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The growth of social media over the last decade has revolutionized the way 
individuals interact and industries conduct business. Individuals produce 
data at an unprecedented rate by interacting, sharing, and consuming con¬ 
tent through social media. Understanding and processing this new type of 
data to glean actionable patterns presents challenges and opportunities for 
interdisciplinary research, novel algorithms, and tool development. 

Social Media Mining integrates social media, social network analysis, 
and data mining to provide a convenient and coherent platform for students, 
practitioners, researchers, and project managers to understand the basics and 
potentials of social media mining. It introduces the unique problems arising 
from social media data and presents fundamental concepts, emerging issues, 
and effective algorithms for network analysis and data mining. 

Suitable for use in advanced undergraduate and beginning graduate 
courses as well as professional short courses, the text contains exercises 
of different degrees of difficulty that improve understanding and help apply 
concepts, principles, and methods in various scenarios of social media 
mining. 

Reza Zafarani is a research associate of Computer Science and Engineering 
at Arizona State University. He performs research in user behavioral mod¬ 
eling and was among the first to conduct research on user identification and 
behavioral analysis across sites. 

Mohammad Ali Abbasi is a research associate of Computer Science and 
Engineering at Arizona State University. His research is focused on evaluat¬ 
ing user credibility in social media and using social media for humanitarian 
assistance and disaster relief. 

Huan Liu is a professor of Computer Science and Engineering at Arizona 
State University where he has been recognized for excellence in teaching 
and research. His research interests are in data mining, machine learning, 
social computing, artificial intelligence, and investigating problems that 
arise in real-world, data-intensive applications with high-dimensional data 
of disparate forms, such as social media. 
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Preface 


We live in an age of big data. With hundreds of millions of people spending 
countless hours on social media to share, communicate, connect, interact, 
and create user-generated data at an unprecedented rate, social media has 
become one unique source of big data. This novel source of rich data 
provides unparalleled opportunities and great potential for research and 
development. Unfortunately, more data does not necessarily beget more 
good, only more of the right (or relevant) data that enables us to glean gems. 
Social media data differs from traditional data we are familiar with in data 
mining. Thus, new computational methods are needed to mine the data. 
Social media data is noisy, free-format, of varying length, and multimedia. 
Furthermore, social relations among the entities, or social networks form, 
an inseparable part of social media data; hence, it is incumbent that social 
theories and research methods be employed with statistical and data mining 
methods. It is therefore a propitious time for social media mining. 

Social media mining is a rapidly growing new field. It is an interdis¬ 
ciplinary field at the crossroad of disparate disciplines deeply rooted in 
computer science and social sciences. There are an active community and 
a large body of literature about social media. The fast growing interests 
and intensifying need to harness social media data require research and 
the development of tools for finding insights from big social media data. 
This book is one of the intellectual efforts to answer the novel challenges 
of social media. It is designed to enable students, researchers, and practi¬ 
tioners to acquire fundamental concepts and algorithms for social media 
mining. 

Researchers in this emerging field are expected to have knowledge in 
different areas, such as data mining, machine learning, text mining, social 
network analysis, and information retrieval, and are often required to consult 
research papers to learn the state of the art of social media mining. To 
mitigate such a strenuous effort and help researchers get up to speed in a 
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convenient way, we take advantage of our teaching and research of many 
years to survey, summarize, filter, categorize, and connect disparate research 
findings and fundamental concepts of social media mining. This book is 
our diligent attempt to provide an easy reference or entry point to help 
researchers quickly acquire a wide range of essentials of social media 
mining. Social media not only produces big user-generated data; it also 
has a huge potential for social science research, business development, and 
understanding human and group behavior. If you want to share a piece 
of information or a site on social media, you would like to grab precious 
attention from other equally eager users of social media; if you are curious 
to know what is hidden or who is influential in the complex world of social 
media, you might wonder how one can find this information in big and 
messy social media; if you hope to serve your customers better in social 
media, you certainly want to employ effective means to understand them 
better. These are just some scenarios where social media mining can help. If 
one of these scenarios fits your need or you simply wish to learn something 
interesting in this emerging field of social media mining, this book is for 
you. We hope this book can be of benefit to you in accomplishing your 
goals of dealing with big data of social media. 

Book Website and Resources 

The book’s website and further resources can be found at 

http://dmml.asu.edu/smm 

The website provides lecture slides, homework and exam problems, and 
sample projects, as well as pointers to useful material and resources that 
are publicly available and relevant to social media mining. 


To the Instructors 

The book is designed for a one-semester course for senior undergradu¬ 
ate or graduate students. Though it is mainly written for students with a 
background in computer science, readers with a basic understanding of 
probability, statistics, and linear algebra will find it easily accessible. Some 
chapters can be skipped or assigned as a homework assignment for review¬ 
ing purposes if students have knowledge of a chapter. For example, if 
students have taken a data mining or machine learning course, they can 
skip Chapter 5. When time is limited, Chapters 6-8 should be discussed in 
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depth, and Chapters 9 and 10 can be either discussed briefly or assigned as 
part of reading materials for course projects. 


Reza Zafarani 
Mohammad Ali Abbasi 
Huan Liu 
Tempe, AZ 
August, 2013 
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Introduction 


With the rise of social media, the web has become a vibrant and lively realm 
in which billions of individuals all around the globe interact, share, post, 
and conduct numerous daily activities. Information is collected, curated, and 
published by citizen journalists and simultaneously shared or consumed by 
thousands of individuals, who give spontaneous feedback. Social media 
enables us to be connected and interact with each other anywhere and any¬ 
time - allowing us to observe human behavior in an unprecedented scale 
with a new lens. This social media lens provides us with golden oppor¬ 
tunities to understand individuals at scale and to mine human behavioral 
patterns otherwise impossible. As a byproduct, by understanding individ¬ 
uals better, we can design better computing systems tailored to individu¬ 
als’ needs that will serve them and society better. This new social media 
world has no geographical boundaries and incessantly churns out oceans 
of data. As a result, we are facing an exacerbated problem of big data - 
“drowning in data, but thirsty for knowledge.” Can data mining come to the 
rescue? 

Unfortunately, social media data is significantly different from the tradi¬ 
tional data that we are familiar with in data mining. Apart from enormous 
size, the mainly user-generated data is noisy and unstructured, with abun¬ 
dant social relations such as friendships and followers-foilowees. This new 
type of data mandates new computational data analysis approaches that can 
combine social theories with statistical and data mining methods. The press¬ 
ing demand for new techniques ushers in and entails a new interdisciplinary 
field - social media mining. 


1.1 What is Social Media Mining 

Social media shatters the boundaries between the real world and the virtual 
world. We can now integrate social theories with computational methods 
to study how individuals (also known as social atoms) interact and how 
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communities (i.e., social molecules) form. The uniqueness of social media 
data calls for novel data mining techniques that can effectively handle user¬ 
generated content with rich social relations. The study and development 
of these new techniques are under the purview of social media mining, 
an emerging discipline under the umbrella of data mining. Social Media 
Mining is the process of representing, analyzing, and extracting actionable 
patterns from social media data. 

Social Media Mining, introduces basic concepts and principal algorithms 
suitable for investigating massive social media data; it discusses theories 
and methodologies from different disciplines such as computer science, data 
mining, machine learning, social network analysis, network science, sociol¬ 
ogy, ethnography, statistics, optimization, and mathematics. It encompasses 
the tools to formally represent, measure, model, and mine meaningful pat¬ 
terns from large-scale social media data. 

Social media mining cultivates a new kind of data scientist who is 
well versed in social and computational theories, specialized to analyze 
recalcitrant social media data, and skilled to help bridge the gap from what 
we know (social and computational theories) to what we want to know about 
the vast social media world with computational tools. 


1.2 New Challenges for Mining 

Social media mining is an emerging field where there are more problems 
than ready solutions. Equipped with interdisciplinary concepts and theo¬ 
ries, fundamental principles, and state-of-the-art algorithms, we can stand 
on the shoulders of the giants and embark on solving challenging problems 
and developing novel data mining techniques and scalable computational 
algorithms. In general, social media can be considered a world of social 
atoms (i.e., individuals), entities (e.g., content, sites, networks, etc.), and 
interactions between individuals and entities. Social theories and social 
norms govern the interactions between individuals and entities. For effec¬ 
tive social media mining, we collect information about individuals and enti¬ 
ties, measure their interactions, and discover patterns to understand human 
behavior. 

Mining social media data is the task of mining user-generated content 
with social relations. This data 1 presents novel challenges encountered in 
social media mining. 

big data Big Data Paradox. Social media data is undoubtedly big. However, when 
paradox we zoom into individuals for whom, for example, we would like to make 
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relevant recommendations, we often have little data for each specific indi¬ 
vidual. We have to exploit the characteristics of social media and use its 
multidimensional, multisource, and multisite data to aggregate information 
with sufficient statistics for effective mining. 

Obtaining Sufficient Samples. One of the commonly used methods to 
collect data is via application programming interfaces (APIs) from social 
media sites. Only a limited amount of data can be obtained daily. Without 
knowing the population’s distribution, how can we know that our samples 
are reliable representatives of the full data? Consequently, how can we 
ensure that our findings obtained from social media mining are any indica¬ 
tion of true patterns that can benefit our research or business development? 

Noise Removal Fallacy. In classic data mining literature, a successful data 
mining exercise entails extensive data preprocessing and noise removal as 
“garbage in and garbage out.” By its nature, social media data can contain 
a large portion of noisy data. We have observed two important principles: 

(1) blindly removing noise can worsen the problem stated in the big data 
paradox because the removal can also eliminate valuable information, and 

(2) the definition of noise becomes complicated and relative because it is 
dependent on our task at hand. 

Evaluation Dilemma. A standard procedure of evaluating patterns in data 
mining is to have some kind of ground truth. For example, a dataset can be 
divided into training and test sets. Only the training data is used in learning, 
and the test data serves as ground truth for testing. However, ground truth is 
often not available in social media mining. Evaluating patterns from social 
media mining poses a seemingly insurmountable challenge. On the other 
hand, without credible evaluation, how can we guarantee the validity of the 
patterns? 

This book contains basic concepts and fundamental principles that will 
help readers contemplate and design solutions to address these challenges 
intrinsic to social media mining. 


1.3 Book Overview and Reader’s Guide 

This book consists of three parts. Part I, Essentials, outlines ways to rep¬ 
resent social media data and provides an understanding of fundamental 
elements of social media mining. Part II, Communities and Interactions, dis¬ 
cusses how communities can be found in social media and how interactions 
occur and information propagates in social media. Part III, Applications, 
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offers some novel illustrative applications of social media mining. 
Throughout the book, we use examples to explain how things work and 
to deepen the understanding of abstract concepts and profound algorithms. 
These examples show in a tangible way how theories are applied or ideas 
are materialized in discovering meaningful patterns in social media data. 

Consider an online social networking site with millions of members 
in which members have the opportunity to befriend one another, send 
messages to each other, and post content on the site. Facebook, Linlcedln, 
and Twitter are exemplars of such sites. To make sense of data from these 
sites, we resort to social media mining to answer corresponding questions. 
In Part I: Essentials (Chapters 2-5), we learn to answer questions such as 
the following: 

1. Who are the most important people in a social network? 

2. How do people befriend others? 

3. How can we find interesting patterns in user-generated content? 

These essentials come into play in Part II: Communities and Interactions 
(Chapters 6 and 7) where we attempt to analyze how communities are 
formed, how they evolve, and how the qualities of a detected communities 
are evaluated. We show ways in which information diffusion in social media 
can be studied. We aim to answer general questions such as the following: 

1. How can we identify communities in a social network? 

2. When someone posts an interesting article on a social network, how 
far can the article be transmitted in that network? 

In Part III: Application (Chapters 8-10), we exemplify social media 
mining using real-world problems in dealing with social media: measur¬ 
ing influence, recommending in a social environment, and analyzing user 
behavior. We aim to answer these questions: 

1. How can we measure the influence of individuals in a social network? 

2. How can we recommend content or friends to individuals online? 

3. How can we analyze the behavior of individuals online? 

To provide an overall picture of the chapter content, we created a depen¬ 
dency graph among chapters (Fig. 1.1) in which arrows suggest dependen¬ 
cies between chapters. Based on the dependency graph, therefore, a reader 
can start with Chapter 2 (graph essentials), and it is recommended that 
he or she read Chapters 5 (data mining essentials) and 8 (influence and 
homophily) before Chapter 9 (recommendation in social media). We have 
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Figure 1.1. Dependency between Book Chapters. Arrows show dependencies and colors 
represent book parts. 


also color-coded chapter boxes that are of the same level of importance and 
abstraction. The darkest chapters are the essentials of this book, and the 
lightest boxes are those chapters that are more applied and have materials 
that are built on the foundation of other chapters. 


Who Should Read This Book? 

A reader with a basic computer science background and knowledge of 
data structures, search, and graph algorithms will find this book easily 
accessible. Limited knowledge of linear algebra, calculus, probability, and 
statistics will help readers understand technical details with ease. Having a 
data mining or machine learning background is a plus, but not necessary. 

The book is designed for senior undergraduate and graduate students. It 
is organized in such a way that it can be taught in one semester to students 
with a basic prior knowledge of statistics and linear algebra. It can also be 
used for a graduate seminar course by focusing on more advanced chapters 
with the supplement of detailed bibliographical notes. Moreover, the book 
can be used as a reference book for researchers, practitioners, and project 
managers of related fields who are interested in both learning the basics and 
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tangible examples of this emerging field and understanding the potentials 
and opportunities that social media mining can offer. 

1.4 Summary 

As defined by Kaplan and Haenlein [2010], social media is the “group 
of internet-based applications that build on the ideological and techno¬ 
logical foundations of Web 2.0, and that allow the creation and exchange 
of user-generated content.” There are many categories of social media 
including, but not limited to, social networking (Faceboolc or Linkedln), 
microblogging (Twitter), photo sharing (Flickr, Photobucket, or Picasa), 
news aggregation (Google reader, StumbleUpon, or Feedburner), video 
sharing (YouTube, MetaCafe), livecasting (Ustream or Justin.TV), virtual 
worlds (Kaneva), social gaming (World of Warcraft), social search (Google, 
Bing, or Ask.com), and instant messaging (Google Talk, Slcype, or Yahoo! 
messenger). 

The first social media site was introduced by Geocities in 1994, which 
allowed users to create their own homepages. The first social networking 
site, SixDegree.com, was introduced in 1997. Since then, many other social 
media sites have been introduced, each providing service to millions of 
people. These individuals form a virtual world in which individuals (social 
atoms), entities (content, sites, etc.) and interactions (between individuals, 
between entities, between individuals and entities) coexist. Social norms 
and human behavior govern this virtual world. By understanding these 
social norms and models of human behavior and combining them with the 
observations and measurements of this virtual world, one can systematically 
analyze and mine social media. 

Social media mining is the process of representing, analyzing, and 
extracting meaningful patterns from data in social media, resulting from 
social interactions. It is an interdisciplinary field encompassing techniques 
from computer science, data mining, machine learning, social network anal¬ 
ysis, network science, sociology, ethnography, statistics, optimization, and 
mathematics. Social media mining faces grand challenges such as the big 
data paradox, obtaining sufficient samples, the noise removal fallacy, and 
evaluation dilemma. 

Social media mining represents the virtual world of social media in a 
computable way, measures it, and designs models that can help us under¬ 
stand its interactions. In addition, social media mining provides neces¬ 
sary tools to mine this world for interesting patterns, analyze information 
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diffusion, study influence and homophily, provide effective recommenda¬ 
tions, and analyze novel social behavior in social media. 

1.5 Bibliographic Notes 

For historical notes on social media sites and challenges in social media 
refer to [Ellison et al., 2007; Lietsala and Sirkkunen, 2008; Kaplan and 
Flaenlein, 2010; Kleinberg, 1998; Gundecha and Liu, 2012], Kaplan and 
Flaenlein [2010] provide a categorization of social media sites into collabo¬ 
rative projects, blogs, content communities, social networking sites, virtual 
game worlds, and virtual social worlds. Our definition of social media is a 
rather abstract one whose elements are social atoms (individuals), entities, 
and interactions. A more detailed abstraction can be found in the work of 
[Kietzmann et al., 2011], They consider the seven building blocks of social 
media to be identity, conversation, sharing, presence, relationships, reputa¬ 
tion, and groups. They argue that the amount of attention that sites give to 
these building blocks makes them different in nature. For instance YouTube 
provides more functionality in terms of groups than Linkedln. 

Social media mining brings together techniques from many disciplines. 
General references that can accompany this book and help readers better 
understand the material in this book can be found in data mining and web 
mining [Flan et al., 2006; Tang, Wang, and Liu, 2012; Friedman, Flastie, 
and Tibshirani, 2009; Liu, 2007; Chakrabarti, 2003], machine learning 
[Bishop, 2006], and pattern recognition [Duda, Hart, and Stork, 2012] 
texts, as well as network science and social network analysis [Easley and 
Kleinberg, 2010; Scott, 1988; Newman, 2010; Kadushin, 2012; Barrat, 
Barthelemy, and Vespignani, 2008] textbooks. For relevant references on 
optimization refer to [Boyd and Vandenberghe, 2004; Nocedal and Wright, 
2006; Papadimitriou and Steiglitz, 1998; Nemhauser and Wolsey, 1988] 
and for algorithms to [Leiserson et al., 2001; Kleinberg and Tardos, 2005], 
For general references on social research methods consult [Bernard and 
Bernard, 2012; Bryman, 2012], Note that these are generic references and 
more specific references are provided at the end of each chapter. This 
book discusses non-multimedia data in social media. For multimedia data 
analysis refer to [Candan and Sapino, 2010]. 

Recent developments in social media mining can be found in jour¬ 
nal articles in IEEE Transactions on Knowledge and Data Engineering 
(TKDE), ACM Transactions on Knowledge Discovery from Data (TKDD), 
ACM Transactions on Intelligent Systems and Technology (TIST), Social 
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Network Analysis and Mining (SNAM), Knowledge and Information Sys¬ 
tems (KAIS), ACM Transactions on the Web (TWEB), Data Mining and 
Knowledge Discovery (DMKD), World Wide Web Journal, Social Net¬ 
works, Internet Mathematics, IEEE Intelligent Systems, and SIGKDD 
Exploration. Conference papers can be found in proceedings of Knowledge 
Discovery and Data Mining (KDD), World Wide Web (WWW), Associ¬ 
ation for Computational Linguistics (ACL), Conference on Information 
and Knowledge Management (CIKM), International Conference on Data 
Mining (ICDM), Internet Measuring Conference (IMC), International Con¬ 
ference on Weblogs and Social Media (ICWSM), International Conference 
on Web Engineering (ICWE), Pacific-Asia Conference on Knowledge Dis¬ 
covery and Data Mining (PAKDD), The European Conference on Machine 
Learning and Principles and Practice of Knowledge Discovery in Data 
Basis (ECML/PKDD), Web Search and Data Mining (WSDM), Interna¬ 
tional Joint Conferences on Artificial Intelligence (IJCAI), Association for 
the Advancement of Artificial Intelligence (AAAI), Recommender Systems 
(RecSys), Computer-Human Interaction (CHI), SIAM International Con¬ 
ference on Data Mining (SDM), Hypertext (HT), and Social Computing 
Behavioral-Cultural Modeling and Prediction (SBP) conferences. 


1.6 Exercises 

1. Discuss some methodologies that can address the grand challenges of 
social media. 

2. What are the key characteristics of social media that differentiate it 
from other media? Please list at least two with a brief explanation. 

3. What are the different types of social media? Name two, and provide a 
definition and an example for each type. 

4. (a) Visit the websites in Table 1.1 (or find similar ones) and identify 

the types of activities that individuals can perform on each one. 

(b) Similar to questions posed in Section 1.1, design two questions 
that you find interesting to ask with respect to each site. 


Table 1.1. List of Websites 


Amazon 

Flickr 

Facebook 

Twitter 

BlogCatalog 

MySpace 

Last.fm 

Pandora 

Linkedln 

Reddit 

Vimeo 

Del.icio.us 

StumbleUpon 

Yelp 

YouTube 

Meetup 
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5. What marketing opportunities do you think exist in social media? Can 
you outline an example of such an opportunity in Twitter? 

6. How does behavior of individuals change across sites? What behaviors 
remain consistent and what behaviors likely change? What are possible 
reasons behind these differences? 

7. How does social media influence real-world behaviors of individuals? 
Identify a behavior that is due to the usage of, say, Twitter. 

8. Outline how social media can help NGOs fulfill their missions better 
in performing tasks such as humanitarian assistance and disaster relief. 

9. Identify at least three major side effects of information sharing on social 
media. 

10. Rumors spread rapidly on social media. Can you think of some method 
to block the spread of rumors on social media? 




Part I 

Essentials 




2 

Graph Essentials 


We live in a connected world in which networks are intertwined with our 
daily life. Networks of air and land transportation help us reach our destina¬ 
tions; critical infrastructure networks that distribute water and electricity are 
essential for our society and economy to function; and networks of commu¬ 
nication help disseminate information at an unprecedented rate. Finally, our 
social interactions form social networks of friends, family, and colleagues. 
Social media attests to the growing body of these social networks in which 
individuals interact with one another through friendships, email, blogposts, 
buying similar products, and many other mechanisms. 

Social media mining aims to make sense of these individuals embedded 
in networks. These connected networks can be conveniently represented 
using graphs. As an example, consider a set of individuals on a social 
networking site where we want to find the most influential individual. Each 
individual can be represented using a node (circle) and two individuals who 
know each other can be connected with an edge (line). In Figure 2.1, we 
show a set of seven individuals and their friendships. Consider a hypothetical 
social theory that states that “the more individuals you know, the more 
influential you are.” This theory in our graph translates to the individual with 
the maximum degree (the number of edges connected to its corresponding 
node) being the most influential person. Therefore, in this network Juan is 
the most influential individual because he knows four others, which is more 
than anyone else. This simple scenario is an instance of many problems 
that arise in social media, which can be solved by modeling the problem 
as a graph. This chapter formally details the essentials required for using 
graphs and the fundamental algorithms required to explore these graphs in 
social media mining. 
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VERTICES 
AND ACTORS 



Figure 2.1. A Sample Graph. In this graph, individuals are represented with nodes 
(circles), and individuals who know each other are connected with edges (lines). 

2.1 Graph Basics 

In this section, we review some of the common notation used in graphs. 
Any graph contains both a set of objects, called nodes, and the connections 
between these nodes, called edges. Mathematically, a graph G is denoted 
as pair G(V , E), where V represents the set of nodes and E represents the 
set of edges. We formally define nodes and edges next. 

2.1.1 Nodes 

All graphs have fundamental building blocks. One major part of any graph 
is the set of nodes. In a graph representing friendship, these nodes represent 
people, and any pair of connected people denotes the friendship between 
them. Depending on the context, these nodes are called vertices or actors. 
For example, in a web graph, nodes represent websites, and the connections 
between nodes indicate web-links between them. In a social setting, these 
nodes are called actors. The mathematical representation for a set of nodes 
is 


V = [v u v2, (2.1) 

where V is the set of nodes and o,, 1 < i £ n , is a single node. \V\ = n is 
called the size of the graph. In Figure 2.1, n = 7. 


2.1.2 Edges 

Another important element of any graph is the set of edges. Edges connect 
nodes. In a social setting, where nodes represent social entities such as 
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(a) Directed Graph (b) Undirected Graph 

Figure 2.2. A Directed Graph and an Undirected Graph. Circles represent nodes, and 
lines or arrows connecting nodes are edges. 

people, edges indicate inter-node relationships and are therefore known as 
relationships or (social) ties . The edge set is usually represented using E, 

E {e\, C 2 ...., e m }, (2.2) 

where e, , 1 < i < m, is an edge and the size of the set is commonly shown 
as m = \E\. In Figure 2.1, lines connecting nodes represent the edges, so 
in this case, m = 8. Edges are also represented by their endpoints (nodes), 
so e(v\, vi) (or (m, vi)) defines an edge e between nodes v\ and V 2 . Edges 
can have directions, meaning one node is connected to another, but not vice 
versa. When edges are undirected, nodes are connected both ways. Note that 
in Figure 2.2(b), edges e(o \, 02 ) and e(p 2 , »i) are the same edges, because 
there is no direction in how nodes get connected. We call edges in this 
graph undirected edges and this kind of graph an undirected graph. Con¬ 
versely, when edges have directions, e(v\, 02 ) is not the same as e(o 2 , fi). 
Graph 2.2(a) is a graph with directed edges; it is an example of a directed 
graph. Directed edges are represented using arrows. In a directed graph, 
an edge e(vj, vj) is represented using an arrow that starts at v, and ends 
at vj. Edges can start and end at the same node; these edges are called 
loops or self-links and are represented as e(o,-, 0 ,). For any node 0 ,, in an 
undirected graph the set of nodes it is connected to via an edge is called its 
neighborhood and is represented as N(v(). In Figure 2.1, N(Jade) = {Jeff, 
Juan). In directed graphs, node u, has incoming neighbors /Vj n (u,) (nodes 
that connect to o,-) and outgoing neighbors (nodes that v, connects 

to). In Figure 2.2(a), N iD (o 2 ) = {^ 3 } and N out (v 2 ) = {ui, o 3 }. 


RELATIONSHIPS 
AND TIES 


NEIGHBORHOOD 


2.1.3 Degree and Degree Distribution 

The number of edges connected to one node is the degree of that node. 
Degree of a node o, is often denoted using c/,. In the case of directed edges, 
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nodes have in-degrees (edges pointing toward the node) and out-degrees 
(edges pointing away from the node). These values are presented using d' n 
and d° ut , respectively. In social media, degree represents the number of 
friends a given user has. For example, on Facebook, degree represents the 
user’s number of friends, and on Twitter in-degree and out-degree represent 
the number of followers and followees, respectively. In any undirected 
graph, the summation of all node degrees is equal to twice the number of 
edges. 

Theorem 2.1. The summation of degrees in an undirected graph is twice 
the number of edges, 

Y.di = 2\E\. (2.3) 

i 

Proof. Any edge has two endpoints; therefore, when calculating the degrees 
d, and dj for any connected nodes u, and vj, the edge between them con¬ 
tributes 1 to both dj and dp, hence, if the edge is removed, d, and dj become 
dj — 1 and dj — 1, and the summation d k becomes ff k d k — 2. Hence, 
by removal of all m edges, the degree summation becomes smaller by 2m. 
However, we know that when all edges are removed the degree summation 
becomes zero; therefore, the degree summation is 2 x m = 2\E\. □ 

Lemma 2.1. In an undirected graph, there are an even number of nodes 
having odd degree. 

Proof. The result can be derived from the previous theorem directly because 
the summation of degrees is even: 2\E\. Therefore, when nodes with even 
degree are removed from this summation, the summation of nodes with 
odd degree should also be even; hence, there must exist an even number of 
nodes with odd degree. □ 

Lemma 2.2. In any directed graph, the summation of in-degrees is equal 
to the summation of out-degrees, 

E<C = E^ n - (2-4) 

i j 

Proof. The proof is left as an exercise. □ 


Degree Distribution 

In very large graphs, distribution of node degrees ( degree distribution) 
is an important attribute to consider. The degree distribution plays an 
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important role in describing the network being studied. Any distribution 
can be described by its members. In our case, these are the degrees of all 
nodes in the graph. The degree distribution p d (or p(d), or p(d„ = d) gives 
the probability that a randomly selected node v has degree d. Because pd 
is a probability distribution Pd = 1- In a graph with n nodes, pd is 
defined as 



(2.5) 


where rid is the number of nodes with degree d. An important, commonly 
performed procedure is to plot a histogram of the degree distribution, in 
which the x-axis represents the degree (d) and the y-axis represents either 
(i) the number of nodes having that degree (n d ) or (2) the fraction of nodes 
having that degree (p d ). 


Example 2.1. For the graph provided in Figure 2. 1, the degree distribution 
p d for d = { 1,2, 3, 4} is 

Pi = \, P2 = P3 = \, P4 = )■ (2.6) 

Because we have four nodes have degree 2, and degrees 1, 3, and 4 are 
observed once. 


Example 2.2. In social networking sites, friendship relationships can be 
represented by a large graph. In this graph, nodes represent individuals and 
the edges represent friendship relationships. We can compute the degrees 
and plot the degree distribution using a graph where the x-axis is the 
degree and the y-axis is the fraction of nodes with that degree . 1 The degree 
distribution plot for Facebook in May 2012 is shown in Figure 2.3. A 
general trend obseri’able in social networking sites is that there exist many 
users with few connections and there exist a handful of users with very 
large numbers of friends. This is commonly called the power-law degree 
distribution. 

As previously discussed, any graph G can be represented as a pair 
G(V,E), where V is the node set and E is the edge set. Since edges 
are between nodes, we have 


POWER-LAW 

DISTRIBUTION 


E c V x V. 


(2.7) 
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SOCIOMATRIX 



Figure 2.3. Degree Distribution for both the global and U.S. population of Facebook 
users (from Ugander et al. [201 1]). There exist many users with few friends and a few 
users with many friends. This is due to a power-law degree distribution. 

Graphs can also have subgraphs. For any graph G(V, E), a graph 
G'(V', E') is a subgraph of G{V, E), if the following properties hold: 

V c V, (2.8) 

E' c (V' x V') fl E. (2.9) 

2.2 Graph Representation 

We have demonstrated the visual representation of graphs. This represen¬ 
tation, although clear to humans, cannot be used effectively by computers 
or manipulated using mathematical tools. We therefore seek representa¬ 
tions that can store the node and edge sets in a way that (1) does not lose 
information, (2) can be manipulated easily by computers, and (3) can have 
mathematical methods applied easily. 


Adjacency Matrix 

A simple way of representing graphs is to use an adjacency matrix (also 
known as a sociomatrix). Figure 2.4 depicts an example of a graph and 
its corresponding adjacency matrix. A value of 1 in the adjacency matrix 
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Vl 

v 2 

v 3 

v 4 

v 5 

V 6 

V 1 

0 

1 

0 

0 

0 

0 

v 2 

1 

0 

1 

1 

0 

0 

v 3 

0 

1 

0 

1 

0 

0 

v 4 

0 

1 

1 

0 

1 

1 

v 5 

0 

0 

0 

1 

0 

0 

V 6 

0 

0 

0 

1 

0 

0 


(b) Adjacency Matrix 


Figure 2.4. A Graph and Its Corresponding Adjacency Matrix. 


indicates a connection between nodes v, and Vj, and a 0 denotes no con¬ 
nection between the two nodes. When generalized, any real number can be 
used to show the strength of connections between two nodes. The adjacency 
matrix gives a natural mathematical representation for graphs. Note that in 
social networks, because of the relatively small number of interactions, 
many cells remain zero. This creates a large sparse matrix. In numerical 
analysis, a sparse matrix is one that is populated primarily with zeros. 

In the adjacency matrix, diagonal entries represent self-links or loops. 
Adjacency matrices can be commonly formalized as 

_ J1 if Vj is connected to vj, 

IJ |0 otherwise. 


Adjacency List 

In an adjacency list, every node is linked with a list of all the nodes that are 
connected to it. The list is often sorted based on node order or some other 
preference. For the graph shown in Figure 2.4, the corresponding adjacency 
list is shown in Table 2.1 . 


Table 2.1. Adjacency List 


Node 

Connected To 

V\ 

02 

v 2 

Ol, l>2, l>4 

o 3 

02, 0 4 

0 4 

02, O3, O5, 0 6 

05 

04 

06 

04 
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Edge List 

Another simple and common approach to storing large graphs is to save all 
edges in the graph. This is known as the edge list representation. For the 
graph shown in Figure 2.4, we have the following edge list representation: 

(»i, v 2 ) 

(V2, » 3 ) 

(P2, V 4 ) 

(»3> »4) 

(»■4, V 5 ) 

( O4, V 6 ) 

In this representation, each element is an edge and is represented as 
(v{, Vj), denoting that node 0 , is connected to node vj. Since social media 
networks are sparse, both the adjacency list and edge list representations 
save significant space. This is because many zeros need to be stored when 
using adjacency matrices, but do not need to be stored in an adjacency or 
an edge list. 


2.3 Types of Graphs 

In general, there are many basic types of graphs. In this section we discuss 
several basic types of graphs. 

Null Graph. A null graph is a graph where the node set is empty (there 
are no nodes). Obviously, since there are no nodes, there are also no edges. 
Formally, 

G(V,E), V = E = 0. (2.11) 

Empty Graph. An empty or edgeless graph is one where the edge set is 
empty: 

G(V,E), E = 0. (2.12) 

Note that the node set can be non-empty. A null graph is an empty graph 
but not vice versa. 

Directed/Undirected/Mixed Graphs. Graphs that we have discussed thus 
far rarely had directed edges. As mentioned, graphs that only have directed 
edges are called directed graphs and ones that only have undirected ones are 


2.3 Types of Graphs 


21 


called undirected graphs. Mixed graphs have both directed and undirected 
edges. In directed graphs, we can have two edges between i and j (one 
from i to j and one from j to i), whereas in undirected graphs only one 
edge can exist. As a result, the adjacency matrix for directed graphs is not 
in general symmetric (/' connected to j does not mean j is connected to 
z, i.e., A,j f Ajj ), whereas the adjacency matrix for undirected graphs is 
symmetric (A = A r ). 

In social media, there are many directed and undirected networks. For 
instance, Facebook is an undirected network in which if Jon is a friend 
of Mary, then Mary is also a friend of Jon. Twitter is a directed network, 
where follower relationships are not bidirectional. One direction is called 
followers, and the other is denoted as following. 

Simple Graphs and Multigraphs. In the example graphs that we have 
provided thus far, only one edge could exist between any pair of nodes. 
These graphs are denoted as simple graphs. Multigraphs are graphs where 
multiple edges between two nodes can exist. The adjacency matrix for 
multigraphs can include numbers larger than one, indicating multiple edges 
between nodes. Multigraphs are frequently observed in social media where 
two individuals can have different interactions with one another. They can be 
friends and, at the same time, colleagues, group members, or other relation. 
For each one of these relationships, a new edge can be added between the 
individuals, creating a multigraph. 

Weighted Graphs. A weighted graph is one in which edges are associated 
with weights. For example, a graph could represent a map, where nodes 
are cities and edges are routes between them. The weight associated with 
each edge represents the distance between these cities. Formally, a weighted 
graph can be represented as G(V, E, W), where W represents the weights 
associated with each edge, \ W\ = \E\. For an adjacency matrix represen¬ 
tation, instead of 1/0 values, we can use the weight associated with the 
edge. This saves space by combining E and W into one adjacency matrix 
A, assuming that an edge exists between o,- and Vj if and only if lT, y - / 0. 
Depending on the context, this weight can also be represented by in,, or 
w(i, j). 

An example of a weighted graph is the web graph. A web graph is a 
way of representing how internet sites are connected on the web. In general, 
a web graph is a directed graph. Nodes represent sites and edge weights 
represent number of links between sites. Two sites can have multiple links 
pointing to each other, and individual sites can have loops (links pointing 
to themselves). 
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Figure 2.5. A Signed Graph Example. 


SIGNED 
EDGES AND 
SIGNED 
GRAPHS 


A special case of a weighted graph is when we have binary weights 
(0/1 or +/—) on edges. These edges are commonly called signed edges, 
and the weighted graph is called a signed graph. Signed edges can be 
employed to represent contradictory behavior. For instance, one can use 
signed edges to represent friends and foes. A positive (+) edge between 
two nodes represents friendship, and a negative (—) edge denotes that 
the endpoint nodes (individuals) are considered enemies. When edges are 
directed, one endpoint considers the other endpoint a friend or a foe, but 
not vice versa. When edges are undirected, endpoints are mutually friends 
or foes. In another setting, a + edge can denote a higher social status, 
and a — edge can represent lower social status. Social status is the rank 
assigned to one’s position in society. For instance, a school principal can 
be connected via a directed + edge to a student of the school because, in 
the school environment, the principal is considered to be of higher status. 
Figure 2.5 shows a signed graph consisting of nodes and the signed edges 
between them. 


2.4 Connectivity in Graphs 

Connectivity defines how nodes are connected via a sequence of edges in 
a graph. Before we define connectivity, some preliminary concepts need to 
be detailed. 

Adjacent Nodes and Incident Edges. Two nodes v\ and «2 in graph 
G(V, E ) are adjacent when iq and in are connected via an edge: 

is adjacent to u 2 = e{v\,vf)&E. (2.13) 

Two edges e\ ( a , b) and efc, d) are incident when they share one endpoint 
(i.e., are connected via a node): 

e\ (a, b) is incident to cyic, d) 

= (a = c) V (a = d) V (b = c) V ( b = d). (2.14) 

Figure 2.6 depicts adjacent nodes and incident edges in a sample graph. 
In a directed graph, two edges are incident if the ending of one is the 
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©-O-Q 

Figure 2.6. Adjacent Nodes and Incident Edges. In this graph u and v, as well as u and 
w, are adjacent nodes, and edges ( u , v ) and (v , w) are incident edges. 


beginning of the other; that is, the edge directions must match for edges to 
be incident. 

Traversing an Edge. An edge in a graph can be traversed when one starts at 
one of its end-nodes, moves along the edge, and stops at its other end-node. 
So, if an edge e(a, b) connects nodes a and b, then visiting e can start at a 
and end at b. Alternatively, in an undirected graph we can start at b and end 
the visit at a. 


Walk, Path, Trail, Tour, and Cycle. A walk is a sequence of incident 
edges traversed one after another. In other words, if in a walk one traverses 
edges e\{v\, v 2 ), e 2 (v 2 , u 3 ), e 3 (« 3 , v 4 ),..., e n (v n , v n+l ), we have v x as the 
walk’s starting node and o n+ i as the walk’s ending node. When a walk does 
not end where it started («i ^ v„+i) then it is called an open walk. When 
a walk returns to where it was started (v\ = o n+ 1 ), it is called a closed 
walk. Similarly, a walk can be denoted as a sequence of nodes, v\, v 2 , 
u 3 ,..., v„. In this representation, the edges that arc traversed arc e\(v\ , v 2 ), 
e 2 (v 2 , v 3 ),..., e„_i(o„_i, v„). The length of a walk is the number of edges 
traversed during the walk and in our case is n — 1. 

A trail is a walk where no edge is traversed more than once; therefore, 
all walk edges are distinct. A closed trail (one that ends where it started) is 
called a tour or circuit. 

A walk where nodes and edges are distinct is called a path, and a closed 
path is called a cycle. The length of a path or cycle is the number of edges 
traversed in the path or cycle. In a directed graph, we have directed paths 
because traversal of edges is only allowed in the direction of the edges. In 
Figure 2.7, v 4 , o 3 , v&, v 4 , u 2 is a walk; v 4 , « 3 is a path; o 4 , « 3 , V(,, v 4 , v 2 is a 
trail; and v 4 , o 3 , v 4 is both a tour and a cycle. 

A graph has a Hamiltonian cycle if it has a cycle such that all the 
nodes in the graph are visited. It has an Eulerian tour if all the edges are 
traversed only once. Examples of a Hamiltonian cycle and an Eulerian tour 
are provided in Figure 2.8. 

One can perform a random walk on a weighted graph, where nodes are 
visited randomly. The weight of an edge, in this case, defines the probability 
of traversing it. For this to work correctly, we must make sure that for all 


OPEN WALK 
AND 
CLOSED 
WALK 


RANDOM 

WALK 
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Figure 2.7. Walk, Path, Trail, Tour, and Cycle. In this figure, v 4 , o 3 , v 6 , 04, 02 is a walk; 
o 4 , o 3 is a path; o 4 , o 3 , o 6 , « 4 , 02 is a trail; and u 4 , o 3 , t) 6 , U 4 is both a tour and a cycle. 


edges that start at v, we have 

^2 W U* = !' v/ ’ J' w iJ - ^ 2 - 15 ) 

a: 

The random walk procedure is outlined in Algorithm 2.1. The algorithm 
starts at a node vo and visits its adjacent nodes based on the transition 
probability (weight) assigned to edges connecting them. This procedure 
is performed for t steps (provided to the algorithm); therefore, a walk of 
length t is generated by the random walk. 



Figure 2.8. Hamiltonian Cycle and Eulerian Tour. In a Hamiltonian cycle we start at one 
node, visit all other nodes only once, and return to our start node. In an Eulerian tour, we 
traverse all edges only once and return to our start point. In an Eulerian tour, we can visit 
a single node multiple times. In this figure, Oi, v 5 , o 3 , Oi, v 2 , v 4 , v 6 , v 2 , o 3 , t> 4 , v s , v 6 , t>i 
is an Eulerian tour. 
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Algorithm 2.1 Random Walk 

Require: Initial Node vq. Weighted Graph G:(V, E, IV), Steps t 
l: return Random Walk P 
2 : state = 0; 

3: V, = v 0 ; 

4: p = {no}; 

5: while state < t do 
6: state = state + 1; 

7: 

8: select a random node Vj adjacent to o, with probability w t j; 

9: v t = v j ; 

10 : P = P U {Vj}; 

ll: end while 
12 : Return P 


Connectivity. A node u, is connected to node Uj (or vj is reachable from 
Vi) if it is adjacent to it or there exists a path from Vj to vj. A graph is 
connected if there exists a path between any pair of nodes in it. In a directed 
graph, the graph is weakly connected if there exists a path between any 
pair of nodes, without following the edge directions (i.e., directed edges are 
replaced with undirected edges). The graph is strongly connected if there 
exists a directed path (following edge directions) between any pair of nodes. 
Figure 2.9 shows examples of connected, disconnected, weakly connected, 
and strongly connected graphs. 

Components. A component in an undirected graph is a subgraph such 
that, there exists a path between every pair of nodes inside the compo¬ 
nent. Figure 2.10(a) depicts an undirected graph with three components. 
A component in a directed graph is strongly connected if, for every pair 



Figure 2.9. Connectivity in Graphs. 
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(a) A Graph with Three Components (b) A Graph with Three Strongly 

Connected Components 

Figure 2.10. Components in Undirected and Directed Graphs. 



of nodes v and u, there exists a directed path from v to u and one from u 
to v. The component is weakly connected if replacing directed edges with 
undirected edges results in a comiected component. A graph with three 
strongly connected components is shown in Figure 2.10(b). 

Shortest Path. When a graph is comiected, multiple paths can exist between 
any pair of nodes. Often, we are interested in the path that has the shortest 
length. This path is called the shortest path. Applications for shortest paths 
include GPS routing, where users are interested in the shortest path to 
their destination. In this chapter, we denote the length of the shortest path 
between nodes v, and vj as /,• j. The concept of the neighborhood of a node 
Vj can be generalized using shortest paths. An n-hop neighborhood of node 
V{ is the set of nodes that are within n hops distance from node v,. That is, 
their shortest path to o, has length less than or equal to n. 

Diameter. The diameter of a graph is defined as the length of the longest 
shortest path between any pair of nodes in the graph. It is defined only for 
comiected graphs because the shortest path might not exist in disconnected 
graphs. Formally, for a graph G the diameter is defined as 

diameter q = max /, (2.16) 

(vi,Vj)eVxV 


2.5 Special Graphs 

Using general concepts defined thus far, many special graphs can be defined. 
These special graphs can be used to model different problems. We review 
some well-known special graphs and their properties in this section. 


















2.5 Special Graphs 


27 



Figure 2.11. A Forest Containing Three Trees. 

2.5.1 Trees and Forests 

Trees are special cases of undirected graphs. A tree is a graph structure that 
has no cycle in it. In a tree, there is exactly one path between any pair of 
nodes. A graph consisting of set of disconnected trees is called a forest. A 
forest is shown in Figure 2.11. 

In a tree where we have [ V\ nodes, we have \E\ = \V\ — I edges. This 
can be proved by contradiction (see Exercises). 


2.5.2 Special Subgraphs 

Some subgraphs are frequently used because of their properties. Two such 
subgraphs are discussed here. 

1. Spanning Tree. For any connected graph, the spanning tree is a 
subgraph and a tree that includes all the nodes of the graph. Obviously, 
when the original graph is not a tree, then its spanning tree includes 
all the nodes, but not all the edges. There may exist multiple spanning 
trees for a graph. For a weighted graph and one of its spanning trees, 
the weight of that spanning tree is the summation of the edge weights 
in the tree. Among the many spanning trees found for a weighted 
graph, the one with the minimum weight is called the minimum 
spanning tree (MST) . 

For example, consider a set of cities, where roads need to be built to 
connect them. We know the distance between each pair of cities. We 
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Figure 2.12. Minimum Spanning Tree. Nodes represent cities and values assigned to 
edges represent geographical distance between cities. Highlighted edges are roads that 
are built in a way that minimizes their total length. 


can represent each city with a node and the distance between these 
nodes using an edge between them labeled with the distance. This 
graph-based view is shown in Figure 2.12. In this graph, nodes »i, 
V 2 , ■■■ ,vg represent cities, and the values attached to edges represent 
the distance between them. Note that edges only represent distances 
(potential roads!), and roads may not exist between these cities. Due 
to construction costs, the government needs to minimize the total 
mileage of roads built and, at the same time, needs to guarantee that 
there is a path (i.e., a set of roads) that connects every two cities. 
The minimum spanning tree is a solution to this problem. The edges 
in the MST represent roads that need to be built to connect all of 
the cities at the minimum length possible. Figure 2.2 highlights the 
minimum spanning tree. 

2. Steiner Tree. The Steiner Tree problem is similar to the minimum 
spanning tree problem. Given a weighted graph G(V, E, W) and a 
subset of nodes V c V (terminal nodes), the Steiner tree problem 
aims to find a tree such that it spans all the V nodes and the weight 
of the tree is minimized. Note that the problem is different from the 
MST problem because we do not need to span all nodes of the graph 
V, but only a subset of the nodes V'. A Steiner tree example is shown 
in Figure 2.13. In this example, V' = {i> 2 , i>a, vi\- 


2.5.3 Complete Graphs 

A complete graph is a graph where for a set of nodes V, all possible edges 
exist in the graph. In other words, all pairs of nodes are connected with 
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Figure 2 . 13 . Steiner Tree. [Steiner tree for V = {v 2 , 04, 07}]. 


an edge. Hence, 



(2.17) 


Complete graphs with n nodes are often denoted as K n . K\, K 2 . K 2 , and 
K 4 are shown in Figure 2.14. 


2.5.4 Planar Graphs 

A graph that can be drawn in such a way that no two edges cross each other 
(other than the endpoints) is called planar. A graph that is not planar is 
denoted as nonplanar. Figure 2.15 shows an example of a planar graph and 
a nonplanar graph. 


2.5.5 Bipartite Graphs 

A bipartite graph G(V, E) is a graph where the node set can be partitioned 
into two sets such that, for all edges, one endpoint is in one set and the other 
endpoint is in the other set. In other words, edges connect nodes in these 
two sets, but there exist no edges between nodes that belong to the same 
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(a) Planar Graph 

(b) Non-planar Graph 


Figure 2.15 

. Planar and Nonplanar Graphs. 


set. Formally, 


v = v L u V R , 

(2.18) 


v L nv R = 0, 

(2.19) 


E^V l x v r . 

(2.20) 


Figure 2.16(a) shows a sample bipartite graph. In this figure, V L = 
[v\, v 2 ) and V R = {u 3 , v 4 , u 5 }. 

In social media, affiliation networks are well-known examples of bipar¬ 
tite graphs. In these networks, nodes in one part (V L or V R ) represent indi¬ 
viduals, and nodes in the other part represent affiliations. If an individual 



(a) Bipartite Graph 


(b) Affiliation Network 


Figure 2.16. Bipartite Graphs and Affiliation Networks. 
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Figure 2.17. Regular Graph with k = 3. 

is associated with an affiliation, an edge connects the corresponding nodes. 
A sample affiliation network is shown in Figure 2.16(b). 


2.5.6 Regular Graphs 

A regular graph is one in which all nodes have the same degree. A regular 
graph where all nodes have degree 2 is called a 2-regular graph. More 
generally, a graph where all nodes have degree k is called a A--regular graph. 
Regular graphs can be connected or discomiected. Complete graphs are 
examples of regular graphs, where all n nodes have degree n — 1 (i.e., 
k = n — 1). Cycles are also regular graphs, where k = 2. Another example 
for h = 3 is shown in Figure 2.17. 


2.5.7 Bridges 

Consider a graph with several connected components. Edges in this graph 
whose removal will increase the number of connected components are 
called bridges. As the name suggests, these edges act as bridges between 
connected components. The removal of these edges results in the discon¬ 
nection of formerly connected components and hence an increase in the 
number of components. An example graph and all its bridges are depicted 
in Figure 2.18. 


2.6 Graph Algorithms 

In this section, we review some well-known algorithms for graphs, although 
they are only a small fraction of the plethora of algorithms related to graphs. 
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Figure 2.18. Bridges. Highlighted edges represent bridges; their removal will increase 
the number of components in the graph. 

2.6.1 Graph/Tree Traversal 

One of the most useful algorithms for graphs are the traversal algorithms 
for graphs and special subgraphs, such as trees. Consider a social media site 
that has many users, and we are interested in surveying it and computing 
the average age of its users. The usual technique is to start from one user 
and employ some traversal technique to browse his friends and then these 
friends’ friends and so on. The traversal technique guarantees that (1) all 
users are visited and (2) no user is visited more than once. 

In this section discuss two traversal algorithms: depth-first search (DFS) 
and breadth-first search (BPS). 


Depth-First Search (DFS) 

Depth-first search (DFS) starts from a node v h selects one of its neighbors 
Vj e N(i)j), and performs DFS on Vj before visiting other neighbors in 
N(oi). In other words, DFS explores as deep as possible in the graph using 
one neighbor before backtracking to other neighbors. Consider a node i>, 
that has neighbors vj and u*; that is, vj, Uk e N(vi). Let uyo e N(vj) and 
Vj( 2 ) e N(vj) denote neighbors of vj such that u,- Vj(\ } Vj( 2 ). Then for 
a depth-first search starting at , that visits u, next, the next set of visited 
nodes are vpi) and vp 2 )- In other words, a deeper node upp is preferred to 
a neighbor Vk that is closer to v ,. Depth-first search can be used both for 
trees and graphs, but is better visualized using trees. The DFS execution on 
a tree is shown in Figure 2.19(a). 
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(a) Depth-First Search (DFS) 



(b) Breadth-First Search (BFS) 
Figure 2.19. Graph Traversal Example. 
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Algorithm 2.2 Depth-First Search (DFS) 

Require: Initial node v, graph/tree G:( V, E), stack S 
l: return An ordering on how nodes in G are visited 
2 : Push v into 5; 

3: visitOrder = 0; 

4: while S not empty do 
5: node = pop from 5; 

6: if node not visited then 

7: visitOrder = visitOrder +1; 

8: Mark node as visited with order visitOrder, //or print node 

9: Push all neighbors/children of node into S; 

10 : end if 

11: end while 

12 : Return all nodes with their visit order. 


The DFS algorithm is provided in Algorithm 2.2. The algorithm uses a 
stack structure to visit nonvisited nodes in a depth-first fashion. 

Breadth-First Search (BFS) 

Breadth-first search (BFS) starts from a node, visits all its immediate neigh¬ 
bors first, and then moves to the second level by traversing their neighbors. 
Like DFS, the algorithm can be used both for trees and graphs and is 
provided in Algorithm 2.3. 


Algorithm 2.3 Breadth-First Search (BFS) 

Require: Initial node v, graph/tree G(V, E), queue Q 
l: return An ordering on how nodes are visited 
2 : Enqueue v into queue Q; 

3: visitOrder = 0; 

4: while Q not empty do 
5: node = dequeue from Q; 

6: if node not visited then 

7: visitOrder = visitOrder +1; 

8: Mark node as visited with order visitOrder, //or print node 

9: Enqueue all neighbors/children of node into Q-, 

10 : end if 

11: end while 
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The algorithm uses a queue data structure to achieve its goal of breadth 
traversal. Its execution on a tree is shown in Figure 2.19(b). 

In social media, we can use BFS or DFS to traverse a social network: 
the algorithm choice depends on which nodes we are interested in visiting 
first. In social media, immediate neighbors (i.e., friends) are often more 
important to visit first; therefore, it is more common to use breadth-first 
search. 


2.6.2 Shortest Path Algorithms 

In many scenarios, we require algorithms that can compute the shortest path 
between two nodes in a graph. For instance, in the case of navigation, we 
have a weighted network of cities connected via roads, and we are interested 
in computing the shortest path from a source city to a destination city. In 
social media mining, we might be interested in determining how tightly 
connected a social network is by measuring its diameter. The diameter can 
be calculated by finding the longest shortest path between any two nodes in 
the graph. 


Dijkstra s Algorithm 

A well-established shortest path algorithm was developed in 1959 by Eds- 
gerd Dijkstra. The algorithm is designed for weighted graphs with non¬ 
negative edges. The algorithm finds the shortest paths that start from a 
starting node 5 to all other nodes and the lengths of those paths. 

The Dijkstra’s algorithm is provided in Algorithm 2.4. As mentioned, 
the goal is to find the shortest paths and their lengths from a source node 
s to all other nodes in the graph. The distance array (Line 3) keeps track 
of the shortest path distance from s to other nodes. The algorithm starts by 
assuming that there is a shortest path of infinite length to any node, except 
s, and will update these distances as soon as a better distance (shorter path) 
is observed. The steps of the algorithm are as follows: 

1. All nodes are initially unvisited. From the unvisited set of nodes, the 
one that has the minimum shortest path length is selected. We denote 
this node as smallest in the algorithm. 

2. For this node, we check all its neighbors that are still unvisited. 
For each unvisited neighbor, we check if its current distance can be 
improved by considering the shortest path that goes through smallest. 
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Algorithm 2.4 Dijkstra’s Shortest Path Algorithm 

Require: Start node s, weighted graph/tree G:(V, E, W) 
l : return Shortest paths and distances from 5 to all other nodes. 
2 : for v e V do 
3: distance[v\ = oo; 

4: predecessor[v] = — 1; 

5: end for 
6: distance[s\ = 0; 

7: unvisited = V; 

8: while unvisited / 0 do 

9: smallest = arg min veunvisiled distance(v); 

10: if distance(sma!lest)==oo then 

11 : break; 

12 : end if 

13: unvisited = unvisited \ {smallest}', 

14: currentDistance = distance(smallest ); 

15: for adjacent node to smallest', neighbor e unvisited do 

16: newPath = currentDistance+w(smallest, neighbor)', 

17: if newPath < distance(neighbor) then 

18: distance(neighbor)=newPath; 

19: predecessor(neighbor)=smallest; 

20 : end if 

21 : end for 

22 : end while 

23: Return distance}] and predecessor [] arrays 


This can be performed by comparing its current shortest path length 
(distance(neighbor)) to the path length that goes through small¬ 
est ( distance(smallest)+w(smallest, neighbor)). This condition is 
checked in Line 17. 

3. If the current shortest path can be improved, the path and its length 
are updated. The paths are saved based on predecessors in the path 
sequence. Since, for every node, we only need the predecessor to 
reconstruct a path recursively, the predecessor array keeps track of 
this. 

4. A node is marked as visited after all its neighbors are processed and 
is no longer changed in terms of (1) the shortest path that ends with 
it and (2) its shortest path length. 
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Figure 2.20. Dijkstra’s Algorithm Execution Example. The shortest path between node 
s and t is calculated. The values inside nodes at each step show the best shortest path 
distance computed up to that step. An arrow denotes the node being analyzed. 

To further clarify the process, an example of the Dijkstra’s algorithm is 
provided. 

Example 2.3. Figure 2.20 provides an example of the Dijkstra’s shortest 
path algorithm. We are interested in finding the shortest path between s and 
t. The shortest path is highlighted using dashed lines. In practice, shortest 
paths are saved using the predecessor array. 

2.6.3 Minimum Spanning Trees 

A spanning tree of a connected undirected graph is a tree that includes all 
the nodes in the original graph and a subset of its edges. Spanning trees play 
important roles in many real-life applications. A cable company that wants 
to lay wires wishes not only to cover all areas (nodes) but also minimize the 
cost of wiring (summation of edges). In social media mining, consider a 
network of individuals who need to be provided with a piece of information. 
The information spreads via friends, and there is a cost associated with 






38 


Graph Essentials 


Algorithm 2.5 Prim’s Algorithm 

Require: Connected weighted graph G( V, E, W) 
l: return Spanning tree T(V S , E s ) 

2 : V s = { a random node from V}; 

3: E s = {}; 

4: while V i^V s do 

5: e(u, v) = argmin M ueVs ^ v _ Vs w(u, v) 

6: V s = V s U {o}; 

7: E s = E s U e(u, u ); 

8 : end while 

9: Return tree T(V S , E s ) as the minimum spanning tree; 


spreading information among every two nodes. The minimum spanning 
tree of this network will provide the minimum cost required to inform all 
individuals in this network. 

There exist a variety of algorithms for finding minimum spanning trees. A 
famous algorithm for finding MSTs in a weighted graph is Prim’s algorithm 
[Prim, 1957]. Interested readers can refer to the bibliographic notes for 
further references. 


Prim s Algorithm 

Prim’s algorithm is provided in Algorithm 2.5. It starts by selecting a random 
node and adding it to the spanning tree. It then grows the spanning tree by 
selecting edges that have one endpoint in the existing spanning tree and one 
endpoint among the nodes that are not selected yet. Among the possible 
edges, the one with the minimum weight is added to the set (along with 
its endpoint). This process is iterated until the graph is fully spanned. An 
example of Prim’s algorithm is provided in Figure 2.21. 


2.6.4 Network Flow Algorithms 

Consider a network of pipes that connect an infinite water source to a water 
sink. In these networks, given the capacity of these pipes, an interesting 
question is, What is the maximum flow that can be sent from the source to 
the sink? 

Network flow algorithms aim to answer this question. This type of ques¬ 
tion arises in many different fields. At first glance, these problems do not 
seem to be related to network flow algorithms, but there are strong parallels. 
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SOURCE AND 
SINK 



Figure 2.22. A Sample Flow Network. 


For instance, in social media sites where users have daily limits (the capac¬ 
ity, here) of sending messages (the flow) to others, what is the maximum 
number of messages the network should be prepared to handle at any time? 
Before we delve into the details, let us formally define a flow network. 


Flow Network 

A flow network G(V, E, C) 2 is a directed weighted graph, where we have 
the following: 

• V e(u, v) e E, c{u, v) > 0 defines the edge capacity. 

• When (u, v ) e E, (o, u) g E (opposite flow is impossible). 

• s defines the source node and t defines the sink node. An infinite 
supply of flow is connected to the source. 

A sample flow network, along with its capacities, is shown in Figure 2.22. 


Flow 

Given edges with certain capacities, we can fill these edges with the flow up 
to their capacities. This is known as the capacity constraint. Furthermore, 
we should guarantee that the flow that enters any node other than source 
s and sink t is equal to the flow that exits it so that no flow is lost (flow 
conservation constraint). Formally, 

• V(m, v) e E, f(u, v) > 0 defines the flow passing through that edge. 

• V(«, v) e E, 0 < f(u, v ) < c(u, v) (capacity constraint). 

• Vv eV,v <? {s, t}, Et:(k,v)eE f( k ’ v ) = E i:(v,i)eE f( v > 0 ( flow con¬ 
servation constraint). 

Commonly, to visualize an edge with capacity c and flow /, we use the 
notation / /c. A sample flow network with its flows and capacities is shown 
in Figure 2.23. 
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Figure 2.23. A Sample Flow Network with Flows and Capacities. 

Flow Quantity 

The flow quantity (or value of the flow) in any network is the amount 
of outgoing flow from the source minus the incoming flow to the source. 
Alternatively, one can compute this value by subtracting the outgoing flow 
from the sink from its incoming value: 

flow = y /(s, v)~y f( v ’ s ) = /(*> y )- ( 2 - 21 ) 

V V V V 


Example 2.4. The flow quantity for the example in Figure 2.23 is 19: 
flow = Y f(F v) - Y /(«> s) = (11 + 8) - 0 = 19. (2.22) 

V V 

Our goal is to find the flow assignments to each edge with the maximum 
flow quantity. This can be achieved by a maximum flow algorithm. A 
well-established one is the Ford-Fulkerson algorithm [Ford and Fulkerson, 
1956], 


Ford-Fulkerson Algorithm 

The intuition behind this algorithm is as follows: Find a path from source 
to sink such that there is unused capacity for all edges in the path. Use that 
capacity (the minimum capacity unused among all edges on the path) to 
increase the flow. Iterate until no other path is available. 

Before we formalize this, let us define some concepts. 

Given a flow in network G(V,E,C), we define another network 
Gr(V, Er, Cr), called the residual network. This network defines how 
much capacity remains in the original network. The residual network has 
an edge between nodes u and v if and only if either (u, v) or (v, u) exists 
in the original graph. If one of these two exists in the original network, we 
would have two edges in the residual network: one from (; u , v ) and one from 
(v, u). The intuition is that when there is no flow going through an edge in 
the original network, a flow of as much as the capacity of the edge remains 
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(b) Residual Network 

Figure 2.24. A Flow Network and Its Residual. 


in the residual. However, in the residual network, one has the ability to send 
flow in the opposite direction to cancel some amount of flow in the original 
network. 

The residual capacity c R (u, v) for any edge (u, v) in the residual graph 
is 


cr(u, v )= 


fc(u, v ) - fill, V ) 
\f(p,u) 


if (u, v) G E 
if (u, v) f E 


(2.23) 


A flow network example and its resulted residual network are shown in 
Figure 2.24. In the residual network, edges that have zero residual capacity 
are not shown. 


Augmentation and Augmenting Paths 

In the residual graph, when edges are in the same direction as the original 
graph, their capacity shows how much more flow can be pushed along that 
edge in the original graph. When edges are in the opposite direction, their 
capacities show how much flow can be pushed back on the original graph 
edge. So, by finding a flow in the residual, we can augment the flow in 
the original graph. Any simple path from s to t in the residual graph is an 
augmenting path. Since all capacities in the residual are positive, these paths 
can augment flows in the original, thus increasing the flow. The amount of 
flow that can be pushed along this path is equal to the minimum capacity 
along the path, since the edge with that capacity limits the amount of flow 
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being pushed. 1 Given flow f(u, v) in the original graph and flow f R (ii, v) 
and u) in the residual graph, we can augment the flow as follows: 

f augmented^, o) = f(u, v) + f R (u, v) - f R (v, u). (2.24) 

Example 2.5. Consider the graph in Figure 2.24(b) and the augmenting 
path s, V\, 03, 04, V2, t. It has a minimum capacity of 1 along the path, so 
the flow quantity will be 1. We can augment the original graph with this 
path. The new flow graph and its corresponding residual graph are shown 
in Figure 2.25. In the new residual, no more augmenting paths can be found. 

The Ford-Fulkerson algorithm will find the maximum flow in a network, 
but we skip the proof of optimality. Interested readers can refer to the 
bibliographic notes for proof of optimality and further information. The 
algorithm is provided in Algorithm 2.6. 


Algorithm 2.6 Ford-Fulkerson Algorithm 

Require: Connected weighted graph G(V, E, W), Source s. Sink t 
l: return A Maximum flow graph 
2 : V(m, d) e E, f(u, v ) = 0 

3: while there exists an augmenting path p in the residual graph Gr do 
4: Augment flows by p 

5: end while 

6: Return flow value and flow graph; 


WEAK LINK 
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MATCHING 



Figure 2.26. Maximum Bipartite Matching. 


The algorithm searches for augmenting paths, if possible, in the resid¬ 
ual and augments flows in the original flow network. Path finding can be 
achieved by any graph traversal algorithm, such as BFS. 


2.6.5 Maximum Bipartite Matching 
Suppose we are trying to solve the following problem in social media: 

Given n products and m users such that some users are only interested in certain 
products, find the maximum number of products that can be bought by users. 

The problem is graphically depicted in Figure 2.26. The nodes on the 
left represent products and the nodes on the right represent users. Edges 
represent the interest of users in products. Highlighted edges demonstrate 
a matching, where products are matched with users. The figure on the left 
depicts a matching and the figure on the right depicts a maximum matching, 
where no more edges can be added to increase the size of the matching. 

This problem can be reformulated as a bipartite graph problem. Given 
a bipartite graph, where Vl and V R represent the left and right node sets 
(V = V L U Vr), and E represents the edges, we define a matching M, 
M c E, such that, for all v e V, such that each node in V appears in at 
most one edge in M. In other words, either the node is matched appears in 
an edge e e M or not. A maximum bipartite matching M m ;ix is a matching 
such that, for any other matching M' in the graph, |M Ma x[ > \M'\. 

Here we solve the maximum bipartite matching problem using the previ¬ 
ously discussed Ford-Fulkerson maximum flow technique. The problem can 
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Figure 2.27. Maximum Bipartite Matching Using Max Flow. 


be easily solved by creating a flow graph G(V', E', C ) from our bipartite 
graph G(V, E ), as follows: 

. Set V' = V U {.?} U {t}. 

• Connect all nodes in Vl to 5 and all nodes in V R to t, 

E' = E U {( 5 , v)\v e V L } U {(«, t)\v e V R }. (2.25) 

• Set c(u, v) = 1, V(m, d) e E'. 

This procedure is graphically shown in Figure 2.27. By solving the max 
flow for this flow graph, the maximum matching is obtained, since the 
maximum number of edges need to be used between V L and V R for the flow 
to become maximum. 4 


2.6.6 Bridge Detection 

As discussed in Section 2.5.7, bridges or cut edges are edges whose 
removal makes formerly connected components disconnected. Here we list 
a simple algorithm for detecting bridges. This algorithm is computationally 
expensive, but quite intuitive. More efficient algorithms have been described 
for the same task. 

Since we know that, by removing bridges, formerly connected compo¬ 
nents become disconnected, one simple algorithm is to remove edges one 
by one and test if the connected components become disconnected. This 
algorithm is outlined in Algorithm 2.7. 


CUT-EDGES 
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Algorithm 2.7 Bridge Detection Algorithm 

Require: Connected graph G:(V, E ) 
l: return Bridge Edges 
2 : bridgeSet = {} 

3: for e{u, v) e E do 
4: G' = Remove e from G 

5: Disconnected = False; 

6: if BFS in G' starting at it doesn’t visit o then 

7: Disconnected = True; 

8 : end if 

9: if Disconnected then 

10 : bridgeSet = bridgeSet U {e} 

ii: end if 

12 : end for 

13: Return bridgeSet 


The disconnectedness of a component whose edge e(u, v ) is removed 
can be analyzed by means of any graph traversal algorithm (e.g., BFS or 
DFS). Starting at node u , we traverse the graph using BFS and, if node v 
cannot be visited (Line 6), the component has been disconnected and edge 
e is a bridge (Line 10). 


2.7 Summary 

This chapter covered the fundamentals of graphs, starting with a presen¬ 
tation of the fundamental building blocks required for graphs: first nodes 
and edges, and then properties of graphs such as degree and degree dis¬ 
tribution. Any graph must be represented using some data structure for 
computability. This chapter covered three well-established techniques: adja¬ 
cency matrix, adjacency list, and edge list. Due to the sparsity of social net¬ 
works, both adjacency list and edge list are more efficient and save signifi¬ 
cant space when compared to adjacency matrix. We then described various 
types of graphs: null and empty graphs, directed/undirected/mixed graphs, 
simple/multigraphs, and weighted graphs. Signed graphs are examples of 
weighted graphs that can be used to represent contradictory behavior. 

We discussed connectivity in graphs and concepts such as paths, walks, 
trails, tours, and cycles. Components are connected subgraphs. We dis¬ 
cussed strongly and weakly connected components. Given the connectivity 
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of a graph, one is able to compute the shortest paths between different nodes. 
The longest shortest path in the graph is known as the diameter. Special 
graphs can be formed based on the way nodes are comiected and the degree 
distributions. In complete graphs, all nodes are connected to all other nodes, 
and in regular graphs, all nodes have an equal degree. A tree is a graph with 
no cycle. We discussed two special trees: the spanning tree and the Steiner 
tree. Bipartite graphs can be partitioned into two sets of nodes, with edges 
between these sets and no edges inside these sets. Affiliation networks are 
examples of bipartite graphs. Bridges are single-point-of-failure edges that 
can make previously connected graphs disconnected. 

In the section on graph algorithms, we covered a variety of useful tech¬ 
niques. Traversal algorithms provide an ordering of the nodes of a graph. 
These algorithms are particularly useful in checking whether a graph is 
connected or in generating paths. Shortest path algorithms find paths with 
the shortest length between a pair of nodes; Dijkstra’s algorithm is an exam¬ 
ple. Spanning tree algorithms provide subgraphs that span all the nodes and 
select edges that sum up to a minimum value; Prim’s algorithm is an exam¬ 
ple. The Ford-Fulkerson algorithm, is one of the maximum flow algorithms. 
It finds the maximum flow in a weighted capacity graph. Maximum bipartite 
matching is an application of maximum flow that solves a bipartite matching 
problem. Finally, we provided a simple solution for bridge detection. 

2.8 Bibliographic Notes 

The algorithms detailed in this chapter are from three well-known fields: 
graph theory, network science, and social network analysis. Interested read¬ 
ers can get better insight regarding the topics in this chapter by referring 
to general references in graph theory [Bondy and Murty, 1976; West et al., 
2010; Diestel, 2005], algorithms and algorithm design [Kleinberg and Tar- 
dos, 2005; Cormen, 2009], network science [Newman, 2010], and social 
network analysis [Wasserman and Faust, 1994], 

Other algorithms not discussed in this chapter include graph coloring 
[Jensen and Toft, 1994], (quasi) clique detection [Abello, Resende, and 
Sudarsky, 2002], graph isomorphism [McKay, 1980], topological sort algo¬ 
rithms [Cormen, 2009], and the traveling salesman problem (TSP) [Cormen, 
2009], among others. In graph coloring, one aims to color elements of the 
graph such as nodes and edges such that certain constraints are satisfied. 
For instance, in node coloring the goal is to color nodes such that adjacent 
nodes have different colors. Cliques are complete subgraphs. Unfortunately, 
solving many problems related to cliques, such as finding a clique that has 
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more that a given number of nodes, is NP-complete. In clique detection, the 
goal is to solve similar clique problems efficiently or provide approximate 
solutions. In graph isomorphism, given two graphs g and g', our goal is 
to find a mapping / from nodes of g to g' such that for any two nodes of 
g that are connected, their mapped nodes in g' are connected as well. In 
topological sort algorithms, a linear ordering of nodes is found in a directed 
graph such that for any directed edge (u, v) in the graph, node u comes 
before node v in the ordering. In the traveling salesman problem (TSP), we 
are provided cities and pairwise distances between them. In graph theory 
terms, we are given a weighted graph where nodes represent cities and 
edge weights represent distances between cities. The problem is to find the 
shortest walk that visits all cities and returns to the origin city. 

Other noteworthy shortest path algorithms such as the A* [Hart, Nilsson, 
and Raphael, 2003], the Bellman-Ford [Bellman and Ford, 1956], and all¬ 
pair shortest path algorithms such as Floyd-Warshall’s [Floyd, 1962] are 
employed extensively in other literature. 

In spanning tree computation, the Kruskal Algorithm [Kruskal, 1956] 
or Boruvka [Motwani and Raghavan, 1995] are also well-established 
algorithms. 

General references for flow algorithms, other algorithms not discussed 
in this chapter such as the Push-Relabel algorithm, and their optimality can 
be found in [Cormen, 2009; Ahuja et al., 1993]. 

2.9 Exercises 
Graph Basics 

1. Given a directed graph G = (V, E) and its adjacency matrix A, we 
propose two methods to make G undirected, 

A'^ = min(l, A tj + Ap), (2.26) 

4 = Aij x A JU (2.27) 

where A' t ,, is the (i, j ) entry of the undirected adjacency matrix. Discuss 
the advantages and disadvantages of each method. 

Graph Representation 

2. Is it possible to have the following degrees in a graph with 7 nodes? 


[ 4 , 4 , 4 , 3 , 5 , 7 , 2 }. 


( 2 . 28 ) 
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3. Given the following adjacency matrix, compute the adjacency list and 
the edge list. 


fo 


A = 


1 

1 

0 

0 

0 

0 

0 

0 


1 1 0 0 0 0 0 0 ' 

0 1 0 0 0 0 0 0 

1 0 1 1 0 0 0 0 

0 10 1110 0 
0 110 110 0 
0 0 110 110 
0 0 1110 10 
0 0 0 0 1 1 0 1 

0 0 0 0 0 0 1 0 


(2.29) 


Special Graphs 


4. Prove that \E\ = \ V\ — 1 in trees. 


Graph/Network Algorithms 

5. Consider the tree shown in Figure 2.28. Traverse the graph using both 
BFS and DFS and list the order in which nodes are visited in each 
algorithm. 



Figure 2.28. A Sample (Binary) Tree. 


6. For a tree and a node v, under what condition is v visited sooner by 
BFS than DFS? Provide details. 

7. For a real-world social network, is BFS or DFS more desirable? Provide 
details. 

8. Compute the shortest path between any pair of nodes using Dijkstra’s 
algorithm for the graph in Figure 2.29. 

9. Detail why edges with negative weights are not desirable for computing 
shortest paths using Dijkstra’s algorithm. 
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Graph Essentials 



Figure 2.29. Weighted Graph. 

10. Compute the minimal spanning tree using Prim’s algorithm in the graph 
provided in Figure 2.30. 



11. Compute the maximum flow in Figure 2.31. 



12. Given a flow network, you are allowed to change one edge’s capacity. 
Can this increase the flow? Flow can we find the correct edge to change? 

13. Flow many bridges are in a bipartite graph? Provide details. 
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Network Measures 


In February 2012, Kobe Bryant, the American basketball star, joined 
Chinese microblogging site Sina Weibo. Within a few hours, more than 
100,000 followers joined his page, anxiously waiting for his first microblog¬ 
ging post on the site. The media considered the tremendous number of 
followers Kobe Bryant received as an indication of his popularity in China. 
In this case, the number of followers measured Bryant’s popularity among 
Chinese social media users. In social media, we often face similar tasks in 
which measuring different structural properties of a social media network 
can help us better understand individuals embedded in it. Corresponding 
measures need to be designed for these tasks. This chapter discusses mea¬ 
sures for social media networks. 

When mining social media, a graph representation is often used. This 
graph shows friendships or user interactions in a social media network. 
Given this graph, some of the questions we aim to answer are as follows: 

• Who are the central figures (influential individuals) in the network? 

• What interaction patterns are common in friends? 

• Who are the like-minded users and how can we find these similar 
individuals? 

To answer these and similar questions, one first needs to define measures 
for quantifying centrality, level of interactions, and similarity, among other 
qualities. These measures take as input a graph representation of a social 
interaction, such as friendships (adjacency matrix), from which the measure 
value is computed. 

To answer our first question about finding central figures, we define 
measures for centrality. By using these measures, we can identify various 
types of central nodes in a network. To answer the other two questions, we 
define corresponding measures that can quantify interaction patterns and 
help find like-minded users. We discuss centrality next. 
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3.1 Centrality 

Centrality’ defines how important a node is within a network. 


3.1.1 Degree Centrality 


PROMINENCE 

OR 

PRESTIGE 


In real-world interactions, we often consider people with many connections 
to be important. Degree centrality transfers the same idea into a measure. 
The degree centrality measure ranks nodes with more connections higher 
in terms of centrality. The degree centrality C c / for node o, in an undirected 
graph is 


C'd(Pi') — tli , 


(3.1) 


where d t is the degree (number of adjacent edges) of node u,. In directed 
graphs, we can either use the in-degree, the out-degree, or the combination 
as the degree centrality value: 


C d (vi) = d™ (prestige), 

Cd(vi) = d° ul (gregariousness), 

cm) = d? + dr. 


(3.2) 

(3.3) 

(3.4) 


When using in-degrees, degree centrality measures how popular a node 
is and its value shows prominence or prestige. When using out-degrees, it 
measures the gregariousness of a node. When we combine in-degrees and 
out-degrees, we are basically ignoring edge directions. In fact, when edge 
directions are removed, Equation 3.4 is equivalent to Equation 3.1, which 
measures degree centrality for undirected graphs. 

The degree centrality measure does not allow for centrality values to be 
compared across networks (e.g., Facebook and Twitter). To overcome this 
problem, we can normalize the degree centrality values. 


Normalizing Degree Centrality 


Simple normalization methods include normalizing by the maximum 
possible degree, 


PROMINENCE 

OR 

PRESTIGE 



(3.5) 
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Figure 3.1. Degree Centrality Example. 


where n is the number of nodes. We can also normalize by the maximum 
degree, 


cr(h') = 


dj 


max, dj 

Finally, we can normalize by the degree sum, 

r-surn/',, \ _ dj _ dj _ dj 

d E jdj 2\E\ 2m 


(3-6) 


(3.7) 


Example 3.1. Figure 3.1 shows a sample graph. In this graph, degree 
centrality for node v\ is C ( /(v\) = d i = 8, and for all others, it is Cdipj) = 
dj = 1. j f 1- 


3.1.2 Eigenvector Centrality 

In degree centrality, we consider nodes with more connections to be more 
important. However, in real-world scenarios, having more friends does 
not by itself guarantee that someone is important: having more important 
friends provides a stronger signal. 

Eigenvector centrality tries to generalize degree centrality by incorpo¬ 
rating the importance of the neighbors (or incoming neighbors in directed 
graphs). It is defined for both directed and undirected graphs. To keep track 
of neighbors, we can use the adjacency matrix A of a graph. Let c e (v,) 
denote the eigenvector centrality of node u,-. We want the centrality of u, to 
be a function of its neighbors’ centralities. We posit that it is proportional 
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to the summation of their centralities, 

1 ” 

CeiPi) = jJ^AjjCeCvj), 


(3.8) 


where A is some fixed constant. Assuming C e = (C e (v\), C e (v 2 ),..., 
C e (o n )) T is the centrality vectors for all nodes, we can rewrite Equation 
3.8 as 


AC e = A t C e . 


(3.9) 


This basically means that C e is an eigenvector of adjacency matrix A 1 
(or A in undirected networks, since A = A T ) and A is the corresponding 
eigenvalue. A matrix can have many eigenvalues and, in turn, many corre¬ 
sponding eigenvectors. Hence, this raises the question: which eigenvalue- 
eigenvector pair should we select? We often prefer centrality values to be 
positive for convenient comparison of centrality values across nodes. Thus, 
perron- we can choose an eigenvalue such that the eigenvector components are 
frobenius positive. 1 This brings us to the Perron-Frobenius theorem. 

THEOREM 


Theorem 3.1 (Perron-Frobenius Theorem). Let A e 7?/ 7X " represent the 
adjacency matrix for a [strongly] connected graph or A : A, j > 0 (i.e. 
a positive n by n matrix). There exists a positive real number (Perron- 
Frobenius eigenvalue) A max , such that A max is an eigenvalue of A and any 
other eigenvalue of A is strictly smaller than A max . Furthermore, there exists 
a corresponding eigenvector v = (v 1 , V 2 ,... ,v n ) of A with eigenvalue A max 
such that Wot > 0. 

Therefore, to have positive centrality values, we can compute the eigen¬ 
values of A and then select the largest eigenvalue. The corresponding 
eigenvector is C e . Based on the Perron-Frobenius theorem, all the com¬ 
ponents of C e will be positive, and this vector corresponds to eigenvector 
centralities for the graph. 

Example 3.2. For the graph shown in Figure 3.2(a), the adjacency matrix 
is 


fO 1 01 


A = 10 1 

0 1 0 


(3.10) 


Based on Equation 3.9, we need to solve XC e = AC e , or 


(A - XI)C e = 0 . 


( 3 . 11 ) 
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(a) A three node graph 


(b) A five node graph 


Figure 3.2. Eigenvector Centrality Example. 


Assuming C e = [u\ u 2 uj,] T , 


0 

1 

0 


U\ 


'O' 

1 

0 


U 2 

= 

0 

1- 

0 

0 

1 

>- 

1_ 


_ u 2 


. 0 . 


Since C e 7 [0 0 0] r , the characteristic equation is 


det{A 


U) = 


0-/1 

1 

0 


1 

0-/1 

1 


0 

1 

0-/1 


= 0 , 


or equivalently, 

(~A)(A 2 - 1) - l(-2) = 2A — A 3 = A(2 - A 2 ) = 0. 


(3.12) 


(3.13) 


(3-14) 


So the eigenvalues are (—72, 0, +>/ 2 ). We select the largest eigenvalue: 
\[2. We compute the corresponding eigenvector: 


0-V2 1 0 

1 0-72 1 
0 1 0-72 



U\ 


'O' 


U2 

= 

0 


u 3 


. 0 . 


(3.15) 


Assuming C e vector has norm 1, its solution is 



U\ 


■ 1/2 ' 

C e = 

U 2 

= 

V 2/2 


. u 3 . 


. 1/2 . 


(3.16) 


which denotes that node »i is the most central node and nodes v\ and v 3 
have equal centrality values. 
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Example 3.3. For the graph shown in Figure 3.2(b), the adjacency matrix 
is as follows: 


A = 


'0 

1 

0 

1 

.0 


1 0 
0 1 
1 0 
1 1 
1 0 


1 O' 
1 1 
1 0 
0 0 
0 0. 


(3.17) 


The eigenvalues of A are (—1.74,—1.27, 0.00,+0.33,+2.68). For 
eigenvector centrality, the largest eigenvalue is selected: 2.68. The cor¬ 
responding eigenvector is the eigenvector centrality vector and is 


C e 


' 0.4119 " 
0.5825 
0.4119 
0.5237 
.0.2169. 


(3.18) 


Based on eigenvector centrality, node t >2 is the most central node. 


3.1.3 Katz Centrality 

A major problem with eigenvector centrality arises when it considers 
directed graphs (see Problem 1 in the Exercises). Centrality is only passed 
on when we have (outgoing) edges, and in special cases such as when a 
node is in a directed acyclic graph, centrality becomes zero, even though 
the node can have many edges connected to it. In this case, the problem can 
be rectified by adding a bias term to the centrality value. The bias term (3 is 
added to the centrality values for all nodes no matter how they are situated 
in the network (i.e., irrespective of the network topology). The resulting 
centrality measure is called the Katz centrality and is formulated as 

n 

C Ka tz(+) = a £ Aj'iCxUvj) + p. (3.19) 

J= i 

The first term is similar to eigenvector centrality, and its effect is 
controlled by constant a. The second term /?, is the bias term that avoids 
zero centrality values. We can rewrite Equation 3.19 in a vector form, 


CfCatz — txA C+atz + 


( 3 . 20 ) 
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Figure 3.3. Katz Centrality Example. 


where 1 is a vector of all l’s. Taking the first term to the left hand side and 
factoring C Ka tz, 


C Ka ,z = /?(I -aA T )~ l -l. (3.21) 

Since we are inverting a matrix here, not all a values are acceptable. 
When a = 0, the eigenvector centrality part is removed, and all nodes get 
the same centrality value f>. However, once a gets larger, the effect of 
fl is reduced, and when det(I — a A 1 ) = 0 , the matrix I — a A 1 becomes 
non-invertible and the centrality values diverge. The det(I — aA T ) first 
becomes 0 when a = 1/2, where 2 is the largest eigenvalue 2 of A T . In 
practice, a < 1/2 is selected so that centralities are computed correctly. 

Example 3.4. For the graph shown in Figure 3.3, the adjacency matrix is 
as follows: 


A = 


'0 1 
1 0 
1 1 
1 1 
.0 1 


1 1 
1 1 
0 1 
1 0 
1 0 


= A 1 


(3.22) 


The eigenvalues of A are (—1.68,—1.0,—1.0,+0.35,+3.32). The 
largest eigenvalue of A is 2 = 3.32. We assume a = 0.25 <1/2 and 
[i = 0.2. Then, Katz centralities are 


C Katz = P(l - aA V • 1 


‘ 1.14' 
1.31 
1.31 
1.14 
.0.85. 


(3.23) 


DIVERGENCE 

IN 

CENTRALITY 

COMPUTATION 


Thus, nodes V 2 , and U 3 have the highest Katz centralities. 
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PAGERANK 
AND GOOGLE 
WEB SEARCH 



Figure 3.4. PageRank Example. 


3.1.4 PageRank 

Similar to eigenvector centrality, Katz centrality encounters some chal¬ 
lenges. A challenge that happens in directed graphs is that, once a node 
becomes an authority (high centrality), it passes all its centrality along all 
of its out-links. This is less desirable, because not everyone known by a 
well known person is well known. To mitigate this problem, one can divide 
the value of passed centrality by the number of outgoing links (out-degree ) 
from that node such that each connected neighbor gets a fraction of the 
source node’s centrality: 

W = ( 3 - 24 ) 

j= i a J 

This equation is only defined when d° ut is nonzero. Thus, assuming 
that all nodes have positive out-degrees ( d° ut > 0)’ , Equation 3.24 can be 
reformulated in matrix format, 

C p = aA T D~ x C p + fll, (3.25) 

which we can reorganize, 

C p = /3{l - aA T D~ l Y l • 1, (3.26) 

where D = diag(d° ut , cA’ Llt ,..., d° ut ) is a diagonal matrix of degrees. The 
centrality measure is known as the PageRank centrality measure and is 
used by the Google search engine as a measure for ordering webpages. 
Webpages and their links represent an enormous web-graph. PageRank 
defines a centrality measure for the nodes (webpages) in this web-graph. 
When a user queries Google, webpages that match the query and have higher 
PageRank values are shown first. Similar to Katz centrality, in practice, 
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a < j is selected, where X is the largest eigenvalue of A T D 1 . In undirected 
graphs, the largest eigenvalue of A T D~ l is X = 1; therefore, a < 1. 

Example 3.5. For the graph shown in Figure 3.4, the adjacency matrix is 
as follows, 


A = 


'0 1 
1 0 
0 1 
1 0 
1 1 


o i r 
1 o 1 
o 1 1 
1 o o 
1 o o. 


(3.27) 


We assume a = 0.95 < 1 and [1 = 0.1. Then, PageRank values are 


C p = /3(I -aA T D~ 1 )- 1 ■ 1 = 


■2.14' 

2.13 

2.14 
1.45 

.2.13. 


(3.28) 


Flence, nodes t>i and have the highest PageRank values. 


3.1.5 Betweenness Centrality 

Another way of looking at centrality is by considering how important nodes 
are in connecting other nodes. One approach, for a node v h is to compute 
the number of shortest paths between other nodes that pass through o,, 

C b (vi) = J2 (3-29) 

fT '" 

where a st is the number of shortest paths from node s to t (also known 
as information pathways), and a sl (oi) is the number of shortest paths from 
s to t that pass through v,. In other words, we are measuring how central 
vt ’s role is in connecting any pair of nodes s and t. This measure is called 
betweenness centrality. 

Betweenness centrality needs to be normalized to be comparable across 
networks. To normalize betweenness centrality, one needs to compute the 
maximum value it takes. Betweenness centrality takes its maximum value 
when node o, is on all shortest paths from .v to t for any pair (.v, t); that 
is, V (s, t), s f t f vt, = 1. For instance, in Figure 3.1, node v\ is 
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on the shortest path between all other pairs of nodes. Thus, the maximum 
value is 


«»,) = £ Y. l=2(", 1 )=(»-lX»-2). 

(3.30) 

The betweenness can be divided by its maximum value to obtain the 
normalized betweenness, 


cr><) = 


C b (vi ) 

2(V)' 


(3.31) 


Computing Betweenness 

In betweenness centrality (Equation 3.29), we compute shortest paths 
between all pairs of nodes to compute the betweenness value. If an algo¬ 
rithm such as Dijkstra’s is employed, it needs to be run for all nodes, because 
Dijkstra’s algorithm will compute shortest paths from a single node to all 
other nodes. So, to compute all-pairs shortest paths, Dijkstra’s algorithm 
needs to be run | V\ — 1 times (with the exception of the node for which 
centrality is being computed). More effective algorithms such as the Bran- 
des’ algorithm [Brandes, 2001] have been designed. Interested readers can 
refer to the bibliographic notes for further references. 

Example 3.6. For Figure 3.1, the (normalized) betweenness centrality of 
node v\ is 


CM = 2 ( 2 ) ’ (3.32) 

C b ° Tm (vi) = 1. (3.33) 

Since all the paths that go through any pair (s, t), s f t f t>i pass 
through node v\, the centrality is 2(ij). Similarly, the betweenness centrality 
for any other node in this graph is 0. 


Example 3.7. Figure 3.5 depicts a sample graph. In this graph, the between¬ 
ness centrality for node v\ is 0, since no shortest path passes through it. 
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For other nodes, we have 

C b (p 2 ) = 2 x ( ( 1 / 1 ) + ( 1 / 1 ) + (2/2) + (1/2) + ^ ^ ) 

5=0 1 ,?=0 3 J=0i,/=04 5=o l5 /=o 5 s=Vi,t=V4 S=V3,t=V5 S=V4,t=Vs 

= 2x3.5 = 7, (3.34) 

C b (vi) = 2x( 0 + 0 + (1/2) + 0 + (1/2) + 0 ) 

J=Ul,f=D 2 J=t>l,f=D4 s=Vu t=v 5 S=V2,t=V4 s=V2 ' t=05 S=D 4 ,f=t> 5 

= 2 x 1.0 = 2, (3.35) 

q,(o 4 ) = O,(o 3 ) = 2 x 1.0 = 2, (3.36) 

0 ,( 05 ) = 2 x( ^ + ^ + ^ + ^ + ^ + ( 1 / 2 ) ) 

S=V\,t=V2 S=V[,t=V2 S=V[ ,t=V4 S=V2,t=V 3 S=V2,t=V/[ S=V^ t=V 4 

= 2x0.5 = 1, (3.37) 


where centralities are multiplied by 2 because in an undirected graph 

E Gst(Vj) _ TV' 

ast 2^is 


js^t^Ui,s<t a st ' 


3.1.6 Closeness Centrality 

In closeness centrality, the intuition is that the more central nodes are, the 
more quickly they can reach other nodes. Formally, these nodes should have 
a smaller average shortest path length to other nodes. Closeness centrality 
is defined as 


CM) = i, (3.38) 

hi 

where I Vi = ^ ^ D . Uj is node o/s average shortest path length to other 

nodes. The smaller the average shortest path length, the higher the centrality 
for the node. 
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Figure 3.6. Example for All Centrality Measures. 


Example 3.8. For nodes in Figure 3.5, the closeness centralities are as 
follows: 


C c (oi) = 1 / ((1 + 2 + 2 + 3)/4 ) = 0.5, (3.39) 

C c (v 2 ) = 1 / ((1 + 1 + 1 + 2)/4 ) = 0.8, (3.40) 

C c (u 3 ) = C b (v 4 ) = 1 / ((1 + 1 + 2 + 2)/4 ) = 0.66, (3.41) 

C c (v 5 ) = 1 / ((1 + 1 + 2 + 3)/4 ) = 0.57. (3.42) 

Ftence, node v >2 has the highest closeness centrality. 

The centrality measures discussed thus far have different views on what 
a central node is. Thus, a central node for one measure may be deemed 
unimportant by other measures. 

Example 3.9. Consider the graph in Figure 3. 6 . For this graph, we compute 
the top three central nodes based on degree, eigenvector, Katz, PageRank, 
betweenness, and closeness centrality methods. These nodes are listed in 
Table 3.1. 

As shown in the table, there is a high degree of similarity’ between 
most central nodes for the first four measures, which utilize eigenvectors 
or degrees: degree centrality, eigenvector centrality, Katz centrality’, and 


Table 3.1. A Comparison between Centrality Methods 



First node 

Second node 

Third node 

Degree Centrality 

or ve 

t »6 or O 3 

0 e {r>4, 05, 07, 08, 09} 

Eigenvector Centrality’ 

V 6 

03 

o 4 or v 5 

Katz Centrality, a = /? = 0.3 

06 

0 3 

V4 or V5 

PageRank : a = f = 0.3 

V 3 

06 

u 2 

Betweenness Centrality 

06 

07 

03 

Closeness Centrality 

06 

y 3 or Vj 

v 7 or y 3 
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PageRank. Betweenness centrality also generates similar results to close¬ 
ness centrality because both use the shortest paths to find most central 
nodes. 


3.1.7 Group Centrality 

All centrality measures defined so far measure centrality for a single node. 
These measures can be generalized for a group of nodes. In this section, 
we discuss how degree centrality, closeness centrality, and betweenness 
centrality can be generalized for a group of nodes. Let S denote the set of 
nodes to be measured for centrality. Let V — S denote the set of nodes not 
in the group. 


Group Degree Centrality 

Group degree centrality is defined as the number of nodes from outside the 
group that are connected to group members. Formally, 

Cf m " p (S ) = |{o, e V — 5|o,- is connected to vj e 5}|. (3.43) 

Similar to degree centrality, we can define connections in terms of out- 
degrees or in-degrees in directed graphs. We can also normalize this value. 
In the best case, group members are connected to all other nonmembers. 
Thus, the maximum value of C| roup (S') is | V — ,S’|. So dividing group degree 
centrality value by | V — S\ normalizes it. 


Group Betweenness Centrality 

Similar to betweeness centrality, we can define group betweenness centrality 
as 


Cf oup (S)= (3.44) 

s&,s<£S,tyS ° st 


where <r st {S) denotes the number of shortest paths between s and t that 
pass through members of S. In the best case, all shortest paths between 5 
and t pass through members of S, and therefore, the maximum value for 
Cf°u p (5) is 2( i( 2 ^). Similar to betweenness centrality, we can normalize 
group betweenness centrality by dividing it by the maximum value. 
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Figure 3.7. Group Centrality Example. 


Group Closeness Centrality 
Closeness centrality for groups can be defined as 

C® roup (5) = jpLp, (3-45) 

where If ouv = J2 Vj ^s ^s,», an d (s> y is the length of the shortest path 
between a group S and a nonmember vj e V — S. This length can be 
defined in multiple ways. One approach is to find the closest member in S 
to Vj: 

min(3.46) 

ViES 

One can also use the maximum distance or the average distance to 
compute this value. 

Example 3.10. Consider the graph in Figure 3. 7. Let S = j vj, 03 }. Group 
degree centrality for S is 

c grou P ( 5 ) = ( 3 . 47 ) 

since members of the group are connected to all other three members 
in V — S = {v\, D 4 , 05}. The normalized value is 1, since 3>/\V — 5| = 1 . 
Group betweenness centrality is 6, since for ify shortest paths between 
any two members ofV — S, the path has to pass through members of S. The 
normalized group betweenness is 1 , since 6 /( 2 (^ 7 ^)) = 1 - Finally, group 
closeness centrality — assuming the distance from nonmembers to members 
of S is computed using the minimum function — is also 1, since any member 
ofV — S is connected to a member of S directly. 


3.2 Transitivity and Reciprocity 

Often we need to observe a specific behavior in a social media network. 
One such behavior is linking behavior. Linking behavior determines how 
links (edges) are formed in a social graph. In this section, we discuss two 
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Figure 3.8. Transitive Linking. 


well-known measures, transitivity and reciprocity, for analyzing this behav¬ 
ior. Both measures are commonly used in directed networks, and transitivity 
can also be applied to undirected networks. 


3.2.1 Transitivity 

In transitivity, we analyze the linking behavior to determine whether it 
demonstrates a transitive behavior. In mathematics, for a transitive relation 
R , a Rb A bRc -> cRa. The transitive linking behavior can be described 
as follows. 


Transitive Linking 

Let v\, V 2 , 03 denote three nodes. When edges ( 1 ) 1 , 02 ) and ( 02 , 03 ) are 
formed, if ( 113 , ui) is also formed, then we have observed a transitive linking 
behavior ( transitivity ). This is shown in Figure 3.8. 

In a less formal setting, 

Transitivity is when a friend of my friend is my friend. 

As shown in the definition, a transitive behavior needs at least three 
edges. These three edges, along with the participating nodes, create a tri¬ 
angle. Higher transitivity in a graph results in a denser graph, which in 
turn is closer to a complete graph. Thus, we can determine how close 
graphs are to the complete graph by measuring transitivity. This can be 
performed by measuring the [global] clustering coefficient and local clus¬ 
tering coefficient. The former is computed for the network, whereas the 
latter is computed for a node. 

Clustering Coefficient 

The clustering coefficient analyzes transitivity in an undirected graph. Since 
transitivity is observed when triangles are formed, we can measure it by 
counting paths of length 2 (edges (v\, oi) and ( 02 , 03 )) and checking whether 
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the third edge ( 03 , 01 ) exists (i.e., the path is closed). Thus, clustering 
coefficient C is defined as 


| Closed Paths of Length 21 
| Paths of Length 2| 

Alternatively, we can count triangles 

(Number of Triangles) x 6 
| Paths of Length 2| 


(3.48) 


(3.49) 


Since every triangle has six closed paths of length 2, we can rewrite 
Equation 3.49 as 


(Number of Triangles) x 3 
Number of Connected Triples of Nodes 


(3.50) 


In this equation, a triple is an ordered set of three nodes, connected by 
two (i.e., open triple) or three (closed triple) edges. Two triples are different 
when 


• their nodes are different, or 

• their nodes are the same, but the triples are missing different edges. 

For example, triples v,Vjin : and VjVkVi are different, since the first triple 
is missing edge e(pk, £>,) and the second triple is missing edge e(v,, Vj), even 
though they have the same members. Following the same argument, triples 
ViVjVj t and VkVjVi are the same, because both are missing edge e(pk, »;) and 
have the same members. Since triangles have three edges, one edge can be 
missed in each triple; therefore, three different triples can be formed from 
one triangle. The number of triangles are therefore multiplied by a factor 
of 3 in the numerator of Equation 3.50. Note that the clustering coefficient 
is computed for the whole network. 


Example 3.11. For the graph in Figure 3.9, the clustering coefficient is 

(Number of Triangles) x 3 


C = 


Number of Connected Triples of Nodes 
2x3 


2x3 + 


= 0.75. 


(3.51) 


V 2 V 1 l)4,V2V3V4 


The clustering coefficient can also be computed locally. The following 
subsection discusses how it can be computed for a single node. 
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Figure 3.9. A Global Clustering Coefficient Example. 


Local Clustering Coefficient 

The local clustering coefficient measures transitivity at the node level. 
Commonly used for undirected graphs, it estimates how strongly neighbors 
of a node v (nodes adjacent to d) are themselves connected. The coefficient 
is defined as 

Number of Pairs of Neighbors of o,- That Are Connected 

C(vj) = -. 

Number of Pairs of Neighbors of ly 

(3.52) 

In an undirected graph, the denominator can be rewritten as (j) = 
di(dj — l)/2, since there are d t neighbors for node o,. 

Example 3.12. Figure 3.10 shows how the local clustering coefficient 
changes for node V\. Thin lines depict t>i s connections to its neighbors. 
Dashed lines denote possible connections among neighbors, and solid lines 
denote current connections among neighbors. Note that when none of the 
neighbors are connected, the local clustering coefficient is zero, and when 
all the neighbors are connected, it becomes maximum, C(vt) = 1. 





C(v,) = 0 


Figure 3.10. Change in Local Clustering Coefficient for Different Graphs. Thin lines 
depict connections to neighbors. Solid lines indicate connected neighbors, and dashed 
lines are the missing connections among neighbors. 
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Figure 3.11. A Graph with Reciprocal Edges. 


3.2.2 Reciprocity 


Reciprocity is a simplified version of transitivity, because it considers closed 
loops of length 2, which can only happen in directed graphs. Formally, if 
node v is connected to node u, u by connecting to v exhibits reciprocity. On 
microblogging site Tumblr, for example, these nodes are known as “mutual 
followers.” Informally, reciprocity is 

If you become my friend. I’ll be yours. 

Figure 3.11 shows an example where two nodes (v\ and 02 ) in the graph 
demonstrate reciprocal behavior. 

Reciprocity counts the number of reciprocal pairs in the graph. Any 
directed graph can have a maximum of \E\/2 pairs. This happens when all 
edges are reciprocal. Thus, this value can be used as a normalization factor. 
Reciprocity can be computed using the adjacency matrix A: 



\E\/2 






(3.53) 


m 
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where Tr (A) = A\ j + ^ 2,2 4-+ A ns = Ym= 1 -4/,; and m is the number 

of edges in the network. Note that the maximum value for JT j AjjAjj is 
m when all directed edges are reciprocated. 

Example 3.13. For the graph shown in Figure 3.11, the adjacency matrix 
is 


A = 


'0 1 
1 0 
0 1 


1 

0 

0 


Its reciprocity’ is 


R = 


1 , 1 
-TrU 2 ) = -Ti- 
ot 4 


'1 1 0l\ 

Oil 

_1 0 0J / 



(3.54) 


(3.55) 


3.3 Balance and Status 

A signed graph can represent the relationships of nodes in a social network, 
such as friends or foes. For example, a positive edge from node v\ to i >2 
denotes that v\ considers i >2 as a friend and a negative edge denotes that v\ 
assumes v >2 is an enemy. Similarly, we can utilize signed graphs to represent 
the social status of individuals. A positive edge connecting node uj to 
t >2 can also denote that v\ considers 02 ’s status higher than its own in the 
society. Both cases represent interactions that individuals exhibit about their 
relationships. In real-world scenarios, we expect some level of consistency 
with respect to these interactions. For instance, it is more plausible for a 
friend of one’s friend to be a friend than to be an enemy. In signed graphs, 
this consistency translates to observing triads with three positive edges 
(i.e., all friends) more frequently than ones with two positive edges and one 
negative edge (i.e., a friend’s friend is an enemy). Assume we observe a 
signed graph that represents friends/foes or social status. Can we measure 
the consistency of attitudes that individual have toward one another? 

To measure consistency in an individual’s attitude, one needs to utilize 
theories from social sciences to define what is a consistent attitude. In this 
section, we discuss two theories, social balance and social status, that can 
help determine consistency in observed signed networks. Social balance 
theory is used when edges represent friends/foes, and social status theory 
is employed when they represent status. 
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balanced balanced balanced balanced 



unbalanced unbalanced unbalanced unbalanced 


Figure 3.12. Sample Graphs for Social Balance Theory. In balanced triangles, there are 
an even number of negative edges. 


Social Balance Theoiy 

This theory, also known as structural balance theory, discusses consistency 
in friend/foe relationships among individuals. Informally, social balance 
theory says friend/foe relationships are consistent when 

The friend of my friend is my friend, 

The friend of my enemy is my enemy, 

The enemy of my enemy is my friend, 

The enemy of my friend is my enemy. 

We demonstrate a graph representation of social balance theory in Fig¬ 
ure 3.12. In this figure, positive edges demonstrate friendships and negative 
ones demonstrate enemies. Triangles that are consistent based on this theory 
are denoted as balanced and triangles that are inconsistent as unbalanced. 
Let Wjj denote the value of the edge between nodes o,- and Vj . Then, for 
a triangle of nodes Vj, Vj, and ry, it is consistent based on social balance 
theory; that is, it is balanced if and only if 

WijWjkWki > 0. (3.56) 

This is assuming that, for positive edges, w ;/ = 1, and for negative edges, 
Wij = —1. We observe that, for all balanced triangles in Figure 3.12, the 
value WijWjkWki is positive, and for all unbalanced triangles, it is negative. 
Social balance can also be generalized to subgraphs other than triangles. In 
general, for any cycle, if the product of edge values becomes positive, then 
the cycle is socially balanced. 
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Figure 3.13. Sample Graphs for Social Status Theory. The left-hand graph is an unbal¬ 
anced configuration, and the right-hand graph is a balanced configuration. 

Social Status Theory 

Social status theory measures how consistent individuals are in assigning 
status to their neighbors. It can be summarized as follows: 

IfX has a higher status than Y and Y has a higher status than Z, then X should have a 
higher status than Z. 

We show this theory using two graphs in Figure 3.13. In this figure, 
nodes represent individuals. Positive and negative signs show higher or 
lower status depending on the arrow direction. A directed positive edge 
from node X to node Y shows that Y has a higher status than X, and a 
negative one shows the reverse. In the figure on the left, i >2 has a higher 
status than v\ and vj has a higher status than V 2 , so based on status theory, 
t >3 should have a higher status than v\\ however, we see that tu has a 
higher status in our configuration. Based on social status theory, this is 
implausible, and thus this configuration is imbalanced. The graph on the 
right shows a balanced configuration with respect to social status theory. 

In the example provided in Figure 3.13, social status is defined for the 
most general example: a set of three connected nodes (a triad). However, 
social status can be generalized to other graphs. For instance, in a cycle of 
n nodes, where n — 1 consecutive edges are positive and the last edge is 
negative, social status theory considers the cycle balanced. 

Note that the identical configuration can be considered balanced by 
social balance theory and unbalanced based on social status theory (see 
Exercises). 


3.4 Similarity 

In this section we review measures used to compute similarity between 
two nodes in a network. In social media, these nodes can represent 
individuals in a friendship network or products that are related. The similar¬ 
ity between these connected individuals can be computed either based on 
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the network in which they are embedded (i.e., network similarity) or based 
on the similarity of the content they generate (i.e., content similarity). We 
discuss content similarity in Chapter 5. In this section, we demonstrate 
ways to compute similarity between two nodes using network information 
regarding the nodes and edges connecting them. When using network infor¬ 
mation, the similarity between two nodes can be computed by measuring 
their structural equivalence or their regular equivalence. 


3.4.1 Structural Equivalence 

To compute structural equivalence, we look at the neighborhood shared by 
two nodes; the size of this neighborhood defines how similar two nodes 
are. For instance, two brothers have in common sisters, mother, father, 
grandparents, and so on. This shows that they are similar, whereas two 
random male or female individuals do not have much in common and are 
not similar. 

The similarity measures detailed in this section are based on the overlap 
between the neighborhoods of the nodes. Let N(ijj ) and N(vj) be the 
neighbors of nodes u, and vj, respectively. In this case, a measure of node 
similarity can be defined as follows: 


o(o l ,vj) = \N(o,)nN(pj)\. (3.57) 

For large networks, this value can increase rapidly, because nodes may 
share many neighbors. Generally, similarity is attributed to a value that is 
bounded and is usually in the range [0, 1 ]. Various normalization procedures 
can take place such as the Jaccard similarity or the cosine similarity: 


, , |7V(u,-)nV(u y )| 

Jaccard y N(j} , }| > 

(3.58) 

, , \N(Vi)r\N(Vj)\ 

(3.59) 


In general, the definition of neighborhood N(uj) excludes the node itself 
(vj). This leads to problems with the aforementioned similarities because 
nodes that are connected and do not share a neighbor will be assigned zero 
similarity. This can be rectified by assuming nodes to be included in their 
neighborhoods. 
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Figure 3.14. Sample Graph for Computing Similarity. 


Example 3.14. Consider the graph in Figure 3.14. The similarity values 
between nodes V 2 and 05 are 


^Jaccard(^2? t) 5) — 


^Cosine(^2? ^ 5 ) — 


|{»1,P3,0 4 } PI {» 3 ,»6}| 
|{»1, V 3 , V 4 , P 6 }| 

|{Dl,P3,0 4 }n{P3,P 6 }| 
VllPl, »3, w 4 }||{o 3 , w 6 }| 


= 0.25, 

(3.60) 

= 0.40. 

(3.61) 


A more interesting way of measuring the similarity between and vj 
is to compare ct(p, , vj) with the expected value of a (u,-, vj) when nodes 
pick their neighbors at random. The more distant these two values are, the 
more significant the similarity observed between «,• and vj (cr(vj, Oj )) is. 
For nodes v t and vj with degrees d, and dj , this expectation is —, where 
n is the number of nodes. This is because there is a ^ chance of becoming 
vt ’s neighbor and, since u, selects dj neighbors, the expected overlap is di n % 
We can rewrite er(o;, Vj) as 


r (Vi,Vj) = | N(Vi) n N(Vj)\ = Y A i,k A j,k- 


(3.62) 


Hence, a similarity measure can be defined by subtracting the random 
expectation from Equation 3.62: 


^significance ^ ^ A i,k A j,k 


di dj 


- Y Ak^j.k n Y A kk Y A J* 

k k k 

— 'y \ A i.k A j.k n Ai A j 

k 

= Y( A kk A j,k ~ A i A j) 
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- a ' a j - A i A ) + A i A t) 


k 


^ 'X A i,k-A i,k Aj k Aj ^iAj'k + A,Aj) 


k 


- A,-)(A M - Aj), 


(3.63) 


k 


where A t = ^ J2 k A ‘A- The term J2k( A i,k ~ ^i)(^j,k ~ Aj) is basically the 
covariance between A, and A r The covariance can be normalized by the 
multiplication of variances, 


^significance (Pi 5 t)j') 




T,k(Ak - A,)(Aj, k - Aj), 


(3.64) 



pearson which is called the Pearson correlation coefficient. Its value, unlike the 
correlation other two measures, is in the range [—1, 1]. A positive correlation value 


denotes that when v, befriends an individual v k , » / is also likely to befriend 
v k . A negative value denotes the opposite (i.e., when l>, befriends v k , it is 
unlikely for Vj to befriend v k ). A zero value denotes that there is no linear 
relationship between the befriending behavior of v, and oj. 


3.4.2 Regular Equivalence 


In regular equivalence, unlike structural equivalence, we do not look at 
the neighborhoods shared between individuals, but at how neighborhoods 
themselves are similar. For instance, athletes are similar not because they 
know each other in person, but because they know similar individuals, such 
as coaches, trainers, and other players. The same argument holds for any 
other profession or industry in which individuals might not know each 
other in person, but are in contact with very similar individuals. Regular 
equivalence assesses similarity by comparing the similarity of neighbors 
and not by their overlap. 

One way of formalizing this is to consider nodes o, and Vj similar when 
they have many similar neighbors v k and o/. This concept is shown in 
Figure 3.15(a). Formally, 


^regular (pi, Vj) — O. ^ ^ A IJ,- A jj(J Regular (h/. , Vj). 
k,l 


(3.65) 
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- Oregular(Vk,Vl) - ~ 


(v/)**" - - Oregular(Vi ,Vj) ~~ *(vj) 



(a) Original Formulation (b) Relaxed Formulation 

Figure 3.15. Regular Equivalence. Solid lines denote edges, and dashed lines denote 
similarities between nodes. In regular equivalence, similarity between nodes o, and Vj 
is replaced by similarity between (a) their neighbors Vk and 0 / or between (b) neighbor 
Vk and node Vj. 


Unfortunately, this formulation is self-referential because solving for i 
and j requires solving for k and /, solving for k and / requires solving for 
their neighbors, and so on. So, we relax this formulation and assume that 
node Vj is similar to node v, when v, is similar to v, ’s neighbors Vk- This is 
shown in Figure 3.15(b). Formally, 

rTregularC®!' ’ Vj) = (X ' A /,£nR e g U ] ar ( Ok , V j ). (3.66) 

k 

In vector format, we have 


^regular — AG Regular- ( 3 . 67 ) 

A node is highly similar to itself. To make sure that our formulation 
guarantees this, we can add an identity matrix to this vector format. Adding 
an identity matrix will add 1 to all diagonal entries, which represent self¬ 
similarities (T regular (u ; -, Vi): 


^regular — ®^°Regular A~ I. ( 3 . 68 ) 

By rearranging, we get 

^regular = (I - aA)~\ ( 3 . 69 ) 

which we can use to find the regular equivalence similarity. 

Note the similarity between Equation 3.69 and that of Katz centrality 
(Equation 3.2 1). As with Katz centrality, we must be careful how we choose 
a for convergence. A common practice is to select an a such that a < 1 /k, 
where k is the largest eigenvalue of A. 
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Example 3.15. For the graph depicted in Figure 3.14, the adjacency matrix 
is 



'0 

1 

1 

0 

0 

O' 



1 

0 

1 

1 

0 

0 


A = 

1 

1 

0 

0 

1 

0 

(3.70) 


0 

1 

0 

0 

0 

1 



.0 

0 

0 

1 

1 

0 . 



The largest eigenvalue of A is 2.43. We set a = 0.4 < 1/2.43, and we 
compute (/ — 0.4^4) _1 , -which is the similarity matrix, 


1.43 

0.73 

0.73 

0.26 

0.26 

0.16 

0.73 

1.63 

0.80 

0.56 

0.32 

0.26 

0.73 

0.80 

1.63 

0.32 

0.56 

0.26 

0.26 

0.56 

0.32 

1.31 

0.23 

0.46 

0.26 

0.32 

0.56 

0.23 

1.31 

0.46 

0.16 

0.26 

0.26 

0.46 

0.46 

1.27 


^regular — (I 0.4 A) — 


(3.71) 

Any row or column of this matrix shows the similarity to other nodes. We 
can see that node v\ is most similar (other than itself) to nodes «2 and v 3 . 
Furthermore, nodes 02 and 03 have the highest similarity in this graph. 


3.5 Summary 

In this chapter, we discussed measures for a social media network. Central¬ 
ity measures attempt to find the most central node within a graph. Degree 
centrality assumes that the node with the maximum degree is the most cen¬ 
tral individual. In directed graphs, prestige and gregariousness are variants 
of degree centrality. Eigenvector centrality generalizes degree centrality and 
considers individuals who know many important nodes as central. Based 
on the Perron-Frobenius theorem, eigenvector centrality is determined by 
computing the eigenvector of the adjacency matrix. Katz centrality solves 
some of the problems with eigenvector centrality in directed graphs by 
adding a bias term. PageRank centrality defines a normalized version of 
Katz centrality. The Google search engine uses PageRank as a metric to rank 
webpages. Betweenness centrality assumes that central nodes act as hubs 
connecting other nodes, and closeness centrality implements the intuition 
that central nodes are close to all other nodes. Node centrality measures 
can be generalized to a group of nodes using group degree centrality, group 
betweenness centrality, and group closeness centrality. 
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Linking between nodes (e.g., befriending in social media) is the most 
commonly observed phenomenon in social media. Linking behavior is 
analyzed in terms of its transitivity and its reciprocity. Transitivity is “when 
a friend of my friend is my friend.” The transitivity of linking behavior 
is analyzed by means of the clustering coefficient. The global clustering 
coefficient analyzes transitivity within a network, and the local clustering 
coefficient performs that for a node. Transitivity is commonly considered 
for closed triads of edges. For loops of length 2, the problem is simplified 
and is called reciprocity. In other words, reciprocity is when “if you become 
my friend, I’ll be yours.” 

To analyze if relationships are consistent in social media, we used various 
social theories to validate outcomes. Social balance and social status are 
two such theories. 

Finally, we analyzed node similarity measures. In structural equivalence, 
two nodes are considered similar when they share neighborhoods. We dis¬ 
cussed cosine similarity and Jaccard similarity in structural equivalence. 
In regular equivalence, nodes are similar when their neighborhoods are 
similar. 


3.6 Bibliographic Notes 

General reviews of different measures in graphs, networks, the web, and 
social media can be found in [Newman, 2010; Witten, Frank, and Hall, 
2011; Tan et al., 2005; Jiawei and Kamber, 2001; Wasserman and Faust, 
1994], 

A more detailed description of the PageRank algorithm can be found in 
[Page et al., 1999; Liu, 2007]. In practice, to compute the PageRank values, 
the power iteration method is used. Given a matrix A, this method produces 
an eigenvalue 2 and an eigenvector o of A. In the case of PageRank, eigen¬ 
value 2 is set to 1. The iterative algorithm starts with an initial eigenvector 
Do and then, Dk+\ is computed from Vk as follows, 

Vk+ 1 = Ao k . (3.72) 

The iterative process is continued until Ok ~ U/t+i (i.e., convergence 
occurs). Other similar techniques to PageRank for computing influential 
nodes in a webgraph, such as the HITS [Kleinberg, 1998] algorithm, can be 
found in [Chakrabarti, 2003; Kosala and Blockeel, 2000], Unlike PageRank, 
the HITS algorithm considers two types of nodes: authority nodes and hub 
nodes. An authority is a webpage that has many in-links. A hub is a page with 
many out-links. Authority pages have in-links from many hubs. In other 
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words, hubs represent webpages that contain many useful links to authorities 
and authorities are influential nodes in the webgraph. HITS employs an 
iterative approach to compute authority and hub scores for all nodes in the 
graph. Nodes with high authority scores are classified as authorities and 
nodes with high hub scores as hubs. Webpage with high authority scores or 
hub scores can be recommended to users in a web search engine. 

Betweenness algorithms can be improved using all-pair shortest paths 
algorithms [Warshall, 1962] or algorithms optimized for computing 
betweenness, such as the Brandes’ algorithm discussed in [Brandes, 2001; 
Tang and Liu, 2010]. 

A review of node similarity and normalization procedures is provided in 
Leicht, Holme, and Newman [2005], Jaccard similarity was introduced in 
[Jaccard, 1901] and cosine similarity is introduced by Salton and McGill 
[1986], 

REGE [White, 1980, 1984] and CATREGE [Stephen and Martin, 1993] 
are well-known algorithms for computing regular equivalence. 

3.7 Exercises 
Centrality 

1. Come up with an example of a directed connected graph in which 
eigenvector centrality becomes zero for some nodes. Describe when 
this happens. 

2. Does j J > have any effect on the order of centralities? In other words, 
if for one value of f> the centrality value of node v t is greater than 
that of vj, is it possible to change ji in a way such that vf s centrality 
becomes larger than that of u, ’s? 

3. In PageRank, what a values can we select to guarantee that centrality 
values are calculated correctly (i.e., values do not diverge)? 
Calculate PageRank values for this graph when 



4. 
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• a = 1,0 = 0 

• a = 0.85,0 = 1 
•a = 0,0= l 

Discuss the effects of different values of a and 0 for this particular 
problem. 

5. Consider a full n-tree. This is a tree in which every node other than 
the leaves has n children. Calculate the betweenness centrality for 
the root node, internal nodes, and leaves. 

6 . Show an example where the eigenvector centrality of all nodes in the 
graph is the same while betweenness centrality gives different values 
for different nodes. 

Transitivity and Reciprocity 

7. In a directed graph G(V, E), 

• Let p be the probability that any node «,• is connected to any node 
Vj. What is the expected reciprocity of this graph? 

• Let m and n be the number of edges and number of nodes, respec¬ 
tively. What is the maximum reciprocity? What is the minimum? 

8 . Given all graphs {G(V, E)\s.t., \E\ = m, \ V\ = n }, 

(a) When m = 15 and n = 10 , find a graph with a minimum average 
clustering coefficient (one is enough). 

(b) Can you come up with an algorithm to find such a graph for any 
m and «? 

Balance and Status 

9. Find all conflicting directed triad configurations for social balance 
and social status. A conflicting configuration is an assignment of 
positive/negative edge signs for which one theory considers the triad 
balanced and the other considers it unbalanced. 

Similarity 

10. In Figure 3.6, 

• Compute node similarity using Jaccard and cosine similarity for 
nodes vs and 04 . 

• Find the most similar node to W 7 using regular equivalence. 


4 

Network Models 


In May 2011, Facebook had 721 million users, represented by a graph of 721 
million nodes. A Facebook user at the time had an average of 190 friends; 
that is, all Facebook users, taken into account, had a total of 68.5 billion 
friendships (i.e., edges). What are the principal underlying processes that 
help initiate these friendships? More importantly, how can these seemingly 
independent friendships form this complex friendship network? 

In social media, many social networks contain millions of nodes and 
billions of edges. These complex networks have billions of friendships, the 
reasons for existence of most of which are obscure. Humbled by the com¬ 
plexity of these networks and the difficulty of independently analyzing each 
one of these friendships, we design models that generate, on a smaller scale, 
graphs similar to real-world networks. On the assumption that these models 
simulate properties observed in real-world networks well, the analysis of 
real-world networks boils down to a cost-efficient measuring of different 
properties of simulated networks. In addition, these models 

• allow for a better understanding of phenomena observed in real-world 
networks by providing concrete mathematical explanations and 

• allow for controlled experiments on synthetic networks when real- 
world networks are not available. 

We discuss three principal network models in this chapter: the random 
graph model, the small-world model, and the preferential attachment model. 
These models are designed to accurately model properties observed in real- 
world networks. Before we delve into the details of these models, we discuss 
their properties. 


4.1 Properties of Real-World Networks 

Real-world networks share common characteristics. When designing net¬ 
work models, we aim to devise models that can accurately describe these 
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networks by mimicking these common characteristics. To determine these 
characteristics, one a regular practice is to identify their attributes and show 
that measurements for these attributes are consistent across networks. In 
particular, three network attributes exhibit consistent measurements across 
real-world networks: degree distribution, clustering coefficient, and average 
path length. As we recall, degree distribution denotes how node degrees are 
distributed across a network. The clustering coefficient measures transitiv¬ 
ity in a network. Finally, average path length denotes the average distance 
( shortest path length) between pairs of nodes. We discuss how these three 
attributes behave in real-world networks next. 


4.1.1 Degree Distribution 

Consider the distribution of wealth among individuals. Most individuals 
have an average amount of capital, whereas a few are considered wealthy. 
In fact, we observe exponentially more individuals with an average amount 
of capital than wealthier ones. Similarly, consider the population of cities. A 
few metropolitan areas are densely populated, whereas other cities have an 
average population size. In social media, we observe the same phenomenon 
regularly when measuring popularity> or interestingness for entities. For 
instance, 

• Many sites are visited less than a thousand times a month, whereas a 
few are visited more than a million times daily. 

• Most social media users are active on a few sites, whereas some 
individuals are active on hundreds of sites. 

• There are exponentially more modestly priced products for sale com¬ 
pared to expensive ones. 

• There exist many individuals with a few friends and a handful of users 
with thousands of friends. 

The last observation is directly related to node degrees in social media. 
The degree of a node in social media often denotes the number of friends 
an individual has. Thus, the distribution of the number of friends denotes 
the degree distribution of the network. It turns out that in all provided 
observations, the distribution of values follows a power-law distribution. 
For instance, let k denote the degree of a node (i.e., the number of friends 
an individual has). Let p k denote the fraction of individuals with degree k 
(i.e., frequency of observing A- ^ ^ ^ power _ law distribution 


POWER-LAW 

DISTRIBUTION 


Pk = ctk h , 


(4.1) 
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(a) Power-Law Degree Distribution (b) Log-Log Plot of Power-Law 

Degree Distribution 

Figure 4.1. Power-Law Degree Distribution and Its Log-Log Plot. 


where b is the power-law exponent and a is the power-law intercept. A 
power-law degree distribution is shown in Figure 4.1(a). 

Taking the logarithm from both sides of Equation 4. 1, we get 

\np k = — b Ink + Inn. (4.2) 

Equation 4.2 shows that the log-log plot of a power-law distribution is a 
straight line with slope —b and intercept In a (see Figure 4.1(b)). This also 
reveals a methodology for checking whether a network exhibits a power-law 
distribution. We can do the following: 

• Pick a popularity measure and compute it for the whole network. For 
instance, we can take the number of friends in a social network as a 
measure. We denote the measured value as k. 

• Compute p k , the fraction of individuals having popularity k. 

• Plot a log-log graph, where the v-axis represents In k and the y-axis 
represents In p k . 

• If a power-law distribution exists, we should observe a straight line in 
the plot. 

Figure 4.2 depicts some log-log graphs for the number of friends on 
real-world networks. In all networks, a linear trend is observed denoting a 
power-law degree distribution. 

Networks exhibiting power-law degree distribution are often called scale- 
scale-free free networks. Since the majority of social networks are scale-free, we are 
networks interested in models that can generate synthetic networks with a power-law 
degree distribution. 
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(a) Blog Catalog 




Figure 4.2. Log-Log Plots for Power-Law Degree Distribution in Social Media Net¬ 
works. In these figures, the x-axis represents the logarithm of the degree, and the v-axis 
represents the logarithm of the fraction of individuals with that degree (i.e., log(/?*)). The 
line demonstrates the linear trend observed in log-log plots of power-law distributions. 


4.1.2 Clustering Coefficient 

In real-world social networks, friendships are highly transitive. In other 
words, friends of an individual are often friends with one another. These 
friendships form triads of friendships that are frequently observed in social 
networks. These triads result in networks with high average [local] cluster¬ 
ing coefficients. In May 2011, Facebook had an average clustering coeffi¬ 
cient of 0.5 for individuals who had two friends; their degree was 2 [Ugander 
et al., 2011]. This indicates that for 50% of all users with two friends, their 
two friends were also friends with each other. Table 4.1 provides the average 
clustering coefficient for several real-world social networks and the web. 


Table 4.1. Average Local Clustering Coefficient in Real-World Networks 
(from [Broder et al., 2000; Ugander et al., 2011; Mislove et al., 2007]) 


Web 

Facebook 

Flickr 

LiveJournal 

Orkut 

YouTube 

0.081 

0.14 (with 100 friends) 

0.31 

0.33 

0.17 

0.13 
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Table 4.2. Average Path Length in Real-World Networks (from [Broder et ah, 
2000; Ugander et al, 2011 ; Mislove et al., 2007]) 


Web 

Facebook 

Flickr 

LiveJournal 

Orkut 

YouTube 

16.12 

4.7 

5.67 

5.88 

4.25 

5.10 


4.1.3 Average Path Length 


SMALL-WORLD 
AND SIX 
DEGREES OF 
SEPARATION 


In real-world networks, any two members of the network are usually con¬ 
nected via short paths. In other words, the average path length is small. This 
is known as the small-world phenomenon. In the well-known small-world 
experiment conducted in the 1960s by Stanley Milgram, Milgram conjec¬ 
tured that people around the world are connected to one another via a path 
of at most six individuals (i.e., the six degrees of separation). Similarly, 
we observe small average path lengths in social networks. For example, in 
May 2011, the average path length between individuals in the Facebook 
graph was 4.7. This average was 4.3 for individuals in the United States at 
the same time [Ugander et al., 2011]. Table 4.2 provides the average path 
length for real-world social networks and the web. 

These three properties - powerlaw degree distribution, high clustering 
coefficient, and small average path length are consistently observed in 
real-world networks. We design models based on simple assumptions on 
how friendships are formed, hoping that these models generate scale-free 
networks, with high clustering coefficient and small average path lengths. 
We start with the simplest network model, the random graph model. 


4.2 Random Graphs 

We start with the most basic assumption on how friendships can be formed: 

Edges (i.e., friendships) between nodes (i.e., individuals) are formed randomly. 

The random graph model follows this basic assumption. In reality friend¬ 
ships in real-world networks are far from random. By assuming random 
friendships, we simplify the process of friendship formation in real-world 
networks, hoping that these random friendships ultimately create networks 
that exhibit common characteristics observed in real-world networks. 

Formally, we can assume that for a graph with a fixed number of nodes 
n, any of the (") edges can be formed independently, with probability p. 
gui, p) This graph is called a random graph and we denote it as the G(n, p) model. 
This model was first proposed independently by Edgar Gilbert [1959] and 
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Solomonoff and Rapoport [1951], Another way of randomly generating 
graphs is to assume that both the number of nodes n and the number of 
edges m are fixed. However, we need to determine which m edges are 
selected from the set of Q) possible edges. Let £2 denote the set of graphs 
with n nodes and m edges. To generate a random graph, we can uniformly 
select one of the graphs in £2. The number of graphs with n nodes and m 
edges (i.e., |£2[) is 



The uniform random graph selection probability is ^. One can think 
of the probability of uniformly selecting a graph as an analog to p, the 
probability of selecting an edge in G{n, p). 

The second model was introduced by Paul Erdos and Alfred Renyi [1959] 
and is denoted as the G(n, m ) model. In the limit, both models act similarly. Gin 
The expected number of edges in G(n, p) is (") p. Now, if we set Q) p = m 
in the limit, both models act the same because they contain the same number 
of edges. Note that the G(n, m ) model contains a fixed number of edges; 
however, the second model G(n, p) is likely to contain none or all possible 
edges. 

Mathematically, the G(n, p ) model is almost always simpler to analyze; 
hence the rest of this section deals with properties of this model. Note 
that there exist many graphs with n nodes and m edges (i.e., generated by 
G(n, //?))• The same argument holds for G(n, p), and many graphs can be 
generated by the model. Therefore, when measuring properties in random 
graphs, the measures are calculated over all graphs that can be generated 
by the model and then averaged. This is particularly useful when we are 
interested in the average, and not specific, behavior of large graphs. 

In G{n, p), the number of edges is not fixed; therefore, we first examine 
some mathematical properties regarding the expected number of edges 
that are connected to a node, the expected number of edges observed in the 
graph, and the likelihood of observing m edges in a random graph generated 
by the G(n, p) process. 

Proposition 4.1. The expected number of edges connected to a node 
(expected degree) in G(n, p) is (n — 1 )p. 

Proof. A node can be connected to at most n — 1 nodes (via n — 1 edges). 

All edges are selected independently with probability p. Therefore, on 
average (n — 1 )p of them are selected. The expected degree is often denoted 
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using notation c or k in the literature. Since we frequently use k to denote 
degree values, we use c to denote the expected degree of a random graph. 


c = (n — 1 )p , (4.4) 

or equivalently, 

P=——r- (4.5) 

n — 1 

□ 


Proposition 4.2. The expected number of edges in G(n, p) is (") p. 

Proof. Following the same line of argument, because edges are selected 
independently and we have a maximum of Q) edges, the expected number 
of edges is (") p. □ 

Proposition 4.3. In a graph generated by G(n, p) model, the probability 
of observing m edges is 

P(\E\ = m ) = p m { 1 - pp~ m , (4.6) 

which is a binomial distribution. 

Proof, m edges are selected from the Q) possible edges. These edges are 
formed with probability p m , and other edges are not formed (to guarantee 
the existence of only m edges) with probability (1 — pP □ 

Given these basic propositions, we next analyze how random graphs 
evolve as we add edges to them. 


4.2.1 Evolution of Random Graphs 

In random graphs, when nodes form connections, after some time a large 
fraction of nodes get connected (i.e., there is a path between any pair of 
them). This large fraction forms a connected component, commonly called 
giant the largest connected component or the giant component. We can tune the 
component behavior of the random graph model by selecting the appropriate p value. 

In G(n, p), when p = 0, the size of the largest connected component is 0 
(no two pairs are connected), and when p = 1, the size is n (all pairs are 
connected). Table 4.3 provides the size of the largest connected component 
(sic in the figure) for random graphs with 10 nodes and different p values. 
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Table 4.3. Evolution of Random Graphs. Here, p is the random graph 
generation probability, c is the average degree, ds is the diameter size, sic is 
the size of the largest component, and l is the average path length. The 
highlighted column denotes phase transition in the random graph 



p 

0.0 

0.055 

0.11 

1.0 

c 

0.0 

0.8 

»1 

9.0 

ds 

0 

2 

6 

1 

sic 

0 

4 

7 

10 

l 

0.0 

1.5 

2.66 

1.0 


The figure also provides information on the average degree c, the diameter 
size ds, the size of the largest component sic, and the average path length 
l of the random graph. 

As shown, in Table 4.3, as p gets larger, the graph gets denser. When p 
is very small, the following is found: 

1. No giant component is observed in the graph. 

2. Small isolated connected components are formed. 

3. The diameter is small because all nodes are in isolated components, 
in which they are connected to a handful of other nodes. 

As p gets larger, the following occurs: 

1. A giant component starts to appear. 

2. Isolated components become connected. 

3. The diameter values increase. 

At this point, nodes are connected to each other via long paths (see 
p = 0.11 in Table 4.3). As p continues to get larger, the random graph 
properties change again. For larger values, the diameter starts shrinking as 
nodes get connected to each other via different paths (that are likely to be 
shorter). The point where diameter value starts to shrink in a random graph 
is called phase transition. At the point of phase transition, the following 
two phenomena are observed: 

1. The giant component, which just started to appear, starts to grow. 

2. The diameter, which just reached its maximum value, starts decreas¬ 
ing. 


PHASE 

TRANSITION 
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Figure 4.3. Nodes Visited by Moving rc-hops away in a Random Graph, c denotes the 
expected node degree. 

It is proven that in random graphs phase transition occurs when c = 1; 
that is, p = 1 /(n — 1). 

Proposition 4.4. In random graphs, phase transition happens at c = 1. 


Proof. (Sketch) Consider a random graph with expected node degree c, 
where c = p(n — 1). In this graph, consider any connected set of nodes 
S and consider the complement set .S’ = V — S. For the sake of our proof, 
we assume that |S| <<C |S|. Given any node v in S, if we move one hop 
(edge) away from v, we visit approximately c nodes. Following the same 
argument, if we move one hop away from nodes in S, we visit approximately 
|S|c nodes. Assuming |,S is small, the nodes in S only visit nodes in .S', 
and when moving one hop away from S, the set of nodes “guaranteed to 
be connected” gets larger by a factor c (see Figure 4.3). The connected 
set of visited nodes gets c 2 times larger when moving two hops and so on. 
Now, in the limit, if we want this component of visited nodes to become 
the largest connected component, then after traveling n hops, we must 
have 


c" > 1 or equivalently c > 1. 


(4.7) 


Otherwise (i.e., c < 1), the number of visited nodes dies out exponentially. 
Hence, phase transition happens at c = 1 . □ 
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Note that this proof sketch provides an intuitive approach to understand 
the proposition. Interested readers can refer to the bibliographic notes for a 
concrete proof. 

So far we have discussed the generation and evolution of random graphs; 
however, we also need to analyze how random graphs perform in terms of 
mimicking properties exhibited by real-world networks. It turns out that 
random graphs can model average path length in a real-world network 
accurately, but fail to generate a realistic degree distribution or clustering 
coefficient. We discuss these properties next. 


4.2.2 Properties of Random Graphs 

Degree Distribution 

When computing degree distribution, we estimate the probability of observ¬ 
ing P(d„ = d ) for node v. 

Proposition 4.5. For a graph generated by G(n , p), node v has degree d, 
d < n — 1, with probability 


P(d v = d) = d 1 j /(I - p) n - l ~ d , (4.8) 

which is again a binomial degree distribution. 

Proof. The proof is to the reader. 3 □ 

This assumes that n is fixed. We can generalize this result by computing 
the degree distribution of random graphs in the limit (i.e., n —>■ oo). In this 
case, using Equation 4.4 and the fact that lirrq^o ln(l + x) = x, we can 
compute the limit for each term of Equation 4.8: 

lim (1 — p) n ~ x ~ d = lim eMi-rt"- 1 -' = fi m e ln-i-d)\n(i- P ) 
n —>-oo w—>-oo n —>00 

= lim e (»-i-«0in(i-A) = i im e -in- i-d) A = e - C . 


(4.9) 
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We also have 

(n — 1 N 

lim 

n-* oo \ d 


= lim 


(« - 1)! 


n^oo (n — 1 — ^)! d\ 

( (n — 1) x (n — 2) x •••(« — c/) )(« — !— t/)! 


= lim 

—>00 


= lim 

OO 


(n — 1 — c/)! t/! 

((n — 1) x (n — 2) x ■ ■ ■ (n — d)) 


dl 


(n ~ l) d 

d\ 


(4.10) 


We can compute the degree distribution of random graphs in the limit 
by substituting Equations 4.10, 4.9, and 4.4 in Equation 4.8, 


lim P(d B = d) 



P d ( 1 - P)‘ 


(_E_\‘ e -c = e -c c l 

d\ \n — 1 / d\ ’ 


(4.11) 


which is basically the Poisson distribution with mean c. Thus, in the limit, 
random graphs generate Poisson degree distribution, which differs from the 
power-law degree distribution observed in real-world networks. 


Clustering Coefficient 

Proposition 4.6. In a random graph generated by G(n, p), the expected 
local clustering coefficient for node v is p. 


Proof. The local clustering coefficient for node v is 


C(v) = 


number of connected pairs of u ’s neighbors 
number of pairs of v ’s neighbors 


(4.12) 


However, v can have different degrees depending on the edges that are 
formed randomly. Thus, we can compute the expected value for C(v ): 


n — 1 

E(C(o)) = E (C(v)\d v = d) P(d v = d). 

d=0 


(4.13) 
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The first term is basically the local clustering coefficient of a node given 
its degree. For a random graph, we have 

number of connected pairs of v ’s d neighbors 


E (C(v)\d D = d) = 


number of pairs of v ’s d neighbors 


P 4-r 


Q 


(4.14) 


Substituting Equation 4.14 in Equation 4.13, we get 

d=n— 1 

E(C(o)) = P J2 P(d » = d ) = P> (4-15) 


d =0 


where we have used the fact that all probability distributions sum up 
to i. it 


Proposition 4.7. The global clustering coefficient of a random graph gen¬ 
erated by G(n, p) is p. 

Proof. The global clustering coefficient of a graph defines the probability of 
two neighbors of the same node being connected. In random graphs, for any 
two nodes, this probability is the same and is equal to the generation prob¬ 
ability p that determines the probability of two nodes getting connected. 
Note that in random graphs, the expected local clustering coefficient is 
equivalent to the global clustering coefficient. O 

In random graphs, the clustering coefficient is equal to the probability 
p\ therefore, by appropriately selecting p, we can generate networks with 
a high clustering coefficient. Note that selecting a large p is undesirable 
because doing so will generate a very dense graph, which is unrealistic, 
as in the real-world, networks are often sparse. Thus, random graphs are 
considered generally incapable of generating networks with high clustering 
coefficients without compromising other required properties. 


Average Path Length 


Proposition 4.8. The average path length l in a random graph is 


l 


ln|F| 

Inc 


(4.16) 


Proof. (Sketch) The proof is similar to the proof provided in determin¬ 
ing when phase transition happens (see Section 4.2.1). Let V denote the 
expected diameter size in the random graph. Starting with any node in a 
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Table 4.4. A Comparison between Real-World Networks and Simulated Random 
Graphs. In this table, C denotes the average clustering coefficient. The last two 
columns show the average path length and the clustering coefficient for the random 
graph simulated for the real-world network. Note that average path lengths are 
modeled properly, whereas the clustering coefficient is underestimated 


Network 


Original Network 


Simulated Random Graph 

Size 

Average 

Degree 

Average Path 
Length 

C 

Average Path 
Length 

C 

Film Actors 

225,226 

61 

3.65 

0.79 

2.99 

0.00027 

Medline Coauthorship 

1,520,251 

18.1 

4.6 

0.56 

4.91 

1.8 x 10~ 4 

E. Coli 

282 

7.35 

2.9 

0.32 

3.04 

0.026 

C, Elegans 

282 

14 

2.65 

0.28 

2.25 

0.05 


random graph and its expected degree c, one can visit approximately c 
nodes by traveling one edge, c 2 nodes by traveling two edges, and c D nodes 
by traveling “diameter” number of edges. After this step, almost all nodes 
should be visited. In this case, we have 

c D ^ \V\. (4.17) 

In random graphs, the expected diameter size tends to the average path 
length / in the limit. This we provide without proof. Interested readers can 
refer to the bibliographic notes for pointers to concrete proofs. Using this 
fact, we have 

c v ^ c ‘^\V\. (4.18) 

Taking the logarithm from both sides we get / ~ . Therefore, the 

average path length in a random graph is equal to □ 


4.2.3 Modeling Real-World Networks with Random Graphs 

Given a real-world network, we can simulate it using a random graph model. 
We can compute the average degree c in the given network. From c, the 
connection probability p can be computed ( p = n c _^). Using p and the 
number of nodes in the given network n, a random graph model G(n, p) 
can be simulated. Table 4.4 demonstrates the simulation results for various 
real-world networks. As observed in the table, random graphs perform 
well in modeling the average path lengths; however, when considering 
the transitivity, the random graph model drastically underestimates the 
clustering coefficient. 

To tackle this issue, we study the small-world model. 
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Figure 4.4. Regular Lattice of Degree 4. 

4.3 Small-World Model 

The assumption behind the random graph model is that connections in 
real-world networks are formed at random. Although unrealistic, random 
graphs can model average path lengths in real-world networks properly, but 
underestimate the clustering coefficient. To mitigate this problem, Duncan 
J. Watts and Steven Strogatz in 1997 proposed the small-world model. 

In real-world interactions, many individuals have a limited and often at 
least, a fixed number of comiections. Individuals connect with their parents, 
brothers, sisters, grandparents, and teachers, among others. Thus, instead of 
assuming random connections, as we did in random graph models, one can 
assume an egalitarian model in real-world networks, where people have the 
same number of neighbors (friends). This again is unrealistic; however, it 
models more accurately the clustering coefficient of real-world networks. In 
graph theory terms, this assumption is equivalent to embedding individuals 
in a regular network. A regular (ring) lattice is a special case of regular 
networks where there exists a certain pattern for how ordered nodes are 
connected to one another. In particular, in a regular lattice of degree c, nodes 
are connected to their previous c/2 and following c/2 neighbors. Formally, 
for node set V = {ui, 02 , 03 ,..., v n ], an edge exists between node v, and 
Vj if and only if 

0 < |i-y| <c/2. (4.19) 


REGULAR 
RING LATTICE 


A regular lattice of degree 4 is shown in Figure 4.4. 
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Algorithm 4.1 Small-World Generation Algorithm 

Require: Number of nodes \ V\, mean degree c, parameter ft 
l: return A small-world graph G(V, E) 

2 : G = A regular ring lattice with | V\ nodes and degree c 
3: for node v, (starting from v \), and all edges e{v t , Oj), i < j do 
4: v k = Select a node from V uniformly at random. 

5: if rewiring e(v, , Vj) to e(u,, v k ) does not create loops in the graph or 

multiple edges between »,■ and v k then 
6: rewire e(Uj,Vj) with probability E = E — {e(v t , vj)}, E = 

E U [e(vi, v k )}\ 

7: end if 

8 : end for 
9: Return G(V,E) 


The regular lattice can model transitivity well; however, the average path 
length is too high. Moreover, the clustering coefficient takes the value 


3 (c — 2) _ 3 
4(c — 1) ~ 4’ 


(4.20) 


which is fixed and not tunable to clustering coefficient values found in 
real-world networks. To overcome these problems, the proposed small- 
world model dynamically lies between the regular lattice and the random 
network. 

In the small-world model, we assume a parameter [ J > that controls ran¬ 
domness in the model. The model starts with a regular lattice and starts 
adding random edges based on ft. The 0 < < 1 controls how random the 

model is. When ft is 0, the model is basically a regular lattice, and when 
P = 1 , the model becomes a random graph. 

The procedure for generating small-world networks is outlined in Algo¬ 
rithm 4.1. The procedure creates new edges by a process called rewiring. 
Rewiring replaces an existing edge between nodes and Vj with a nonex¬ 
isting edge between o , and v k with probability ji. In other words, an edge is 
disconnected from one of its endpoints vj and connected to a new endpoint 
v k . Node v k is selected uniformly. 

The network generated using this procedure has some interesting prop¬ 
erties. Depending on the ft value, it can have a high clustering coefficient 
and also short average path lengths. The degree distribution, however, still 
does not match that of real-world networks. 
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4.3.1 Properties of the Small-World Model 


Degree Distribution 

The degree distribution for the small-world model is as follows: 


mm(d—c/2,c/2) 

P(d v =d)= Y. 

n =0 



(i 


pyp c / 2 ~n 


(Pc/2) d ~ c P~ n 

(d — cj2 — n ) 


(4.21) 

where P(d„ = d) is the probability of observing degree d for node v. We 
provide this equation without proof due to techniques beyond the scope of 
this book (see Bibliographic Notes). Note that the degree distribution is 
quite similar to the Poisson degree distribution observed in random graphs 
(Section 4.2.2). In practice, in the graph generated by the small-world 
model, most nodes have similar degrees due to the underlying lattice. In 
contrast, in real-world networks, degrees are distributed based on a power- 
law distribution, where most nodes have small degrees and a few have large 
degrees. 


Clustering Coefficient 

The clustering coefficient for a regular lattice is and for a random 

graph model is p = . The clustering coefficient for a small-world net¬ 

work is a value between these two, depending on [>. Commonly, the clus¬ 
tering coefficient for a regular lattice is represented using C(0), and the 
clustering coefficient for a small-world model with f = p is represented as 
C(p). The relation between the two values can be computed analytically; it 
has been proven that 

C(p)*(l-pfC(0). (4.22) 

The intuition behind this relation is that because the clustering coefficient 
enumerates the number of closed triads in a graph, we are interested in triads 
that are still left connected after the rewiring process. For a triad to stay 
connected, all three edges must not be rewired with probability (1 — p). 
Since the process is performed independently for each edge, the probability 
of observing triads is (1 — p f times the probability of observing them in a 
regular lattice. Note that we also need to take into account new triads that 
are formed by the rewiring process; however, that probability is nominal 
and hence negligible. The graph in Figure 4.5 depicts the value of for 
different values of p. 
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P 

Figure 4.5. Clustering Coefficient and Average Path Length Change in the Small-World 
Model (from [Watts and Strogatz, 1997]). In this figure, C(p)/C( 0) denotes the cluster¬ 
ing coefficient of a small-world model, with ft = p over the regular lattice. Similarly, 
L(p)/L( 0) denotes the average path length of a small-world model over the regular 
lattice. Since models with a high clustering coefficient and small average path length 
are desired, [1 values in range 0.01 < fl = p < 0.1 are preferred. 

As shown in the figure, the value for C(p ) stays high until p reaches 
0.1 (10% rewired) and then decreases rapidly to a value around zero. Since 
a high clustering coefficient is required in generated graphs, < 0.1 is 
preferred. 


Average Path Length 

The same procedure can be done for the average path length. The average 
path length in a regular lattice is 


We denote this value as T(0). The average path length in a random graph 
is j”". We denote L(p) as the average path length for a small-world model 
where /? = p. Unlike C( p), no analytical formula for comparing L(p) to 
L( 0) exists; however, the relation can be computed empirically for different 
values of p. Similar to C(p), we plot in Figure 4.5. As shown in the 
figure, the average path length decays sooner than the clustering coefficient 
and becomes stable when around 1 % of edges are rewired. Since we require 
small average path lengths in the generated graphs, ft > 0.01 is preferred. 
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Table 4.5. A Comparison between Real-World Networks and Simulated Graphs 
Using the Small-World Model. In this table C denotes the average clustering 
coefficient. The last two columns show the average path length and the clustering 
coefficient for the small-world graph simulated for the real-world network. Both 
average path lengths and clustering coefficients are modeled properly 


Original Network Simulated Graph 

Average Average Path Average Path 


Network 

Size 

Degree 

Length 

C 

Length 

C 

Film Actors 

225,226 

61 

3.65 

0.79 

4.2 

0.73 

Medline Coauthorship 

1,520,251 

18.1 

4.6 

0.56 

5.1 

0.52 

E. Coli 

282 

7.35 

2.9 

0.32 

4.46 

0.31 

C. Elegans 

282 

14 

2.65 

0.28 

3.49 

0.37 


4.3.2 Modeling Real-World Networks with the Small-World Model 

A desirable model for a real-world network should generate graphs with 
high clustering coefficients and short average path lengths. As shown in 
Figure 4.5, for 0.01 < J3 < 0.10, the small-world network generated is 
acceptable, in which the average path length is small and the clustering 
coefficient is still high. Given a real-world network in which average degree 
c and clustering coefficient C are given, we set C(p ) = C and determine f> 
using Equation 4.22. Given fS, c, and n (size of the real-world network), we 
can simulate the small-world model. 

Table 4.5 demonstrates the simulation results for various real-world net¬ 
works. As observed in the table, the small-world model generates a realistic 
clustering coefficient and small average path length. Note that the small- 
world model is still incapable of generating a realistic degree distribution in 
the simulated graph. To generate scale-free networks (i.e., with a power-law 
degree distribution), we introduce the preferential attachment model next. 

4.4 Preferential Attachment Model 

There exist a variety of scale-free network-modeling algorithms. A well- 
established one is the model proposed by Barabasi and Albert [1999]. The 
model is called preferential attachment or sometimes the Barabasi-Albert 
(BA) model and is as follows: 

When new nodes are added to networks, they are more likely to connect to existing nodes 
that many others have connected to. 

This connection likelihood is proportional to the degree of the node 
that the new node is aiming to connect to. In other words, a rich-get- 
richer phenomenon or aristocrat network is observed where the higher the 
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Algorithm 4.2 Preferential Attachment 

Require: Graph G(Vo, E 0 ), where | Vq\ = m (< and d v > I V v e Vo, number 
of expected connections m < mo, time to run the algorithm t 
l : return A scale-free network 

2 : //Initial graph with »7 0 nodes with degrees at least 
1 

3: G(V,E)= G(V 0 ,E 0 ); 

4: for 1 to t do 

5 : V = f U {»,•};// add new node o,- 

6: while d, f m do 

7: Connect u, to a random node vj e V, i j ( i.e., E = E U 

{e(vi, vj )}) 

with probability P(n,) = ^ . 

8 : end while 

9: end for 

10 : Return G(V,E) 


node’s degree, the higher the probability of new nodes getting connected 
to it. Unlike random graphs in which we assume friendships are formed 
randomly, in the preferential attachment model we assume that individuals 
are more likely to befriend gregarious others. The model’s algorithm is 
provided in Algorithm 4.2. 

The algorithm starts with a graph containing a small set of nodes m 0 
and then adds new nodes one at a time. Each new node gets to connect to 
m < mo other nodes, and each connection to existing node o, depends on 
the degree of o, (i.e., P(v,) = y!‘ ). Intrinsically, higher degree nodes get 

more attention from newly added nodes. Note that the initial mo nodes must 

have at least degree 1 for probability P(pi) = ^4' to be nonzero. 

2^j d J 

The model incorporates two ingredients - (1) the growth element and 
(2) the preferential attachment element - to achieve a scale-free network. 
The growth is realized by adding nodes as time goes by. The preferen¬ 
tial attachment is realized by connecting to node o, based on its degree 


probability, P(Oj) = 


dj 


Removing any one of these ingredients gener¬ 


ates networks that are not scale-free (see Exercises). Next, we show that 
preferential attachment models are capable of generating networks with a 
power-law degree distribution. They are also capable of generating small 
average path length, but unfortunately fail at generating the high clustering 
coefficients observed in real-world networks. 
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4.4.1 Properties of the Preferential Attachment Model 


Degree Distribution 


We first demonstrate that the preferential attachment model generates 
scale-free networks and can therefore model real-world networks. Empir¬ 
ical evidence found by simulating the preferential attachment model 
suggests that this model generates a scale-free network with exponent 
b = 2.9 ±0.1 [Barabasi and Albert, 1999], Theoretically, a mean-field 
[Newman, Barabasi, and Watts, 2006] proof can be provided as follows. 

Let d t denote the degree for node o,. The probability of an edge con¬ 
necting from a new node to Vj is 


P(P,) 


di 

Ej dj' 


(4.24) 


The expected increase to the degree of u,- is proportional to d, (this is true 
on average). Assuming a mean-field setting, the expected temporal change 
in di is 


d dj 
dt 


= mP(Vj ) = 


nidi 

2 / dj 


mdj di 
2m t 21 


(4.25) 


Note that at each time step, m edges are added; therefore, mt edges 
are added over time, and the degree sum V dj is 2m t. Rearranging and 
solving this differential equation, we get 


di(t) = m 



(4.26) 


Here, t, represents the time that n l was added to the network, and because 
we set the expected degree to m in preferential attachment, then d<(t,) = m. 
The probability that di is less than d is 


P(di(t) < d) = P(ti > m 2 t/d 2 ). (4.27) 


Assuming uniform intervals of adding nodes, 

yyi ^ £ j 

P(ti > m 2 t/d 2 ) = 1 — P(ti < m 2 t/d 2 ) =1-—-. (4.28) 

d z (t + m o) 


The factor ^ shows the probability that one time step has passed 
because, at the end of the simulation, t + m 0 nodes are in the network. The 
probability density for P(d) 


P(d) = 


dP(di(t ) < d) 
dd 


(4.29) 
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is what we are interested in, which, when solved, gives 


P(d) = 


2 m 2 t 

d 2 (t + 7Wo) 


and the stationary solution (t —>■ oo), 


2m 2 

P(d)=—, (4.30) 

which is a power-law degree distribution with exponent b = 3. Note that in 
real-world networks, the exponent varies in a range (e.g., [2, 3]); however, 
there is no variance in the exponent of the introduced model. To overcome 
this issue, several other models are proposed. Interested readers can refer 
to the bibliographical notes for further references. 


Clustering Coefficient 

In general, not many triangles are formed by the Barabasi-Albert model, 
because edges are created independently and one at a time. Again, using a 
mean-field analysis, the expected clustering coefficient can be calculated as 


C = 


m o — 1 (In t) 2 

8 T~ 


(4.31) 


where t is the time passed in the system during the simulation. We avoid 
the details of this calculation due to techniques beyond the scope of this 
book. Unfortunately, as time passes, the clustering coefficient gets smaller 
and fails to model the high clustering coefficient observed in real-world 
networks. 


Average Path Length 

The average path length of the preferential attachment model increases 
logarithmically with the number of nodes present in the network: 


In in 
ln(ln | V\)' 


(4.32) 


This indicates that, on average, preferential attachment models generate 
shorter path lengths than random graphs. Random graphs are considered 
accurate in approximating the average path lengths. The same holds for 
preferential attachment models. 
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Table 4.6. A Comparison between Real-World Networks and Simulated Graphs 
using Preferential Attachment. C denotes the average clustering coefficient. The 
last two columns show the average path length and the clustering coefficient for 
the preferential-attachment graph simulated for the real-world network. Note that 
average path lengths are modeled properly, whereas the clustering coefficient is 

underestimated 


Original Network Simulated Graph 

Average Average Path Average Path 


Network 

Size 

Degree 

Length 

C 

Length 

C 

Film Actors 

225,226 

61 

3.65 

0.79 

4.90 

^0.005 

Medline Coauthorship 

1,520,251 

18.1 

4.6 

0.56 

5.36 

^0.0002 

E. Coli 

282 

7.35 

2.9 

0.32 

2.37 

0.03 

C. Elegans 

282 

14 

2.65 

0.28 

1.99 

0.05 


4.4.2 Modeling Real-World Networks with the Preferential 
Attachment Model 

As with random graphs, we can simulate real-world networks by generat¬ 
ing a preferential attachment model by setting the expected degree m (see 
Algorithm 4.2). Table 4.6 demonstrates the simulation results for various 
real-world networks. The preferential attachment model generates realistic 
degree distribution and, as observed in the table, small average path lengths; 
however, the generated networks it fail to exhibit the high clustering coef¬ 
ficient observed in real-world networks. 

4.5 Summary 

In this chapter, we discussed three well-established models that generate 
networks with commonly observed characteristics of real-world networks: 
random graphs, the small-world model, and preferential attachment. Ran¬ 
dom graphs assume that connections are completely random. We discussed 
two variants of random graphs: G{n,p) and G(n,m). Random graphs 
exhibit a Poisson degree distribution, a small clustering coefficient p, and 
a realistic average path length ll ^. 

The small-world model assumes that individuals have a fixed number 
of connections in addition to random connections. This model generates 
networks with high transitivity and short path lengths, both commonly 
observed in real-world networks. Small-world models are created through 
a process where a parameter fl controls how edges are randomly rewired 
from an initial regular ring lattice. The clustering coefficient of the model is 
approximately (1 — pf times the clustering coefficient of a regular lattice. 
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No analytical solution to approximate the average path length with respect 
to a regular ring lattice has been found. Empirically, when between 1% to 
10%of edges are rewired (0.01 < fi < 0.1), the model resembles many real- 
world networks. Unfortunately, the small-world model generates a degree 
distribution similar to the Poisson degree distribution observed in random 
graphs. 

Finally, the preferential attachment model assumes that friendship for¬ 
mation likelihood depends on the number of friends individuals have. The 
model generates a scale-free network; that is, a network with a power- 
law degree distribution. When k denotes the degree of a node, and p k the 
fraction of nodes having degree k, then in a power-law degree distribution, 

p k = ak~ b . (4.33) 

Networks created using a preferential attachment model have a power- 
law degree distribution with exponent b = 2.9 ± 0.1. Using a mean-field 
approach, we proved that this model has a power-law degree distribu¬ 
tion. The preferential attachment model also exhibits realistic average path 
lengths that are smaller than the average path lengths in random graphs. 
The basic caveat of the model is that it generates a small clustering coeffi¬ 
cient, which contradicts high clustering coefficients observed in real-world 
networks. 


4.6 Bibliographic Notes 

General reviews of the topics in this chapter can be found in [Newman, 
Barabasi, and Watts, 2006; Newman, 2010; Barrat, Barthelemy, and Vespig- 
nani, 2008; Jackson, 2010], 

Initial random graph papers can be found in the works of Paul Erdos 
and Alfred Renyi [1959, 1960, 1961] as well as Edgar Gilbert [1959] and 
Solomonoff and Rapoport [1951]. As a general reference, readers can refer 
to [Bollobas, 2001; Newman, Watts, and Strogatz, 2002; Newman, 2002b], 
Random graphs described in this chapter did not have any specific degree 
distribution; however, random graphs can be generated with a specific 
degree distribution. For more on this refer to [Newman, 2010; Newman, 
Strogatz, and Watts, 2000], 

Small-worlds were first noticed in a short story by Hungarian writer F. 
Karinthy in 1929. Works of Milgram in 1969 and Kochen and Pool in 1978 
treated the subject more systematically. Milgram designed an experiment 
in which he asked random participants in Omaha, Nebraska, or Wichita, 
Kansas, to help send letters to a target person in Boston. Individuals were 
only allowed to send the letter directly to the target person if they knew the 
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person on a first-name basis. Otherwise, they had to forward it to someone 
who was more likely to know the target. The results showed that the letters 
were on average forwarded 5.5 to 6 times until they reached the target in 
Boston. Other recent research on small-world model dynamics can be found 
in [Watts, 1999, 2002], 

Price [1965, 1976] was among the first who described power laws 
observed in citation networks and models capable of generating them. 
Power-law distributions are commonly found in social networks and the 
web [Faloutsos et al., 1999; Mislove et al., 2007], The first developers of 
preferential attachment models were Yule [1925], who described these mod¬ 
els for generating power-law distributions in plants, and Herbert A. Simon 
[1955], who developed these models for describing power laws observed 
in various phenomena: distribution of words in prose, scientists by cita¬ 
tions, and cities by population, among others. Simon used what is known as 
the master equation to prove that preferential attachment models generate 
power-law degree distributions. A more rigorous proof for estimating the 
power-law exponent of the preferential attachment model using the master 
equation method can be found in [Newman, 2010]. The preferential attach¬ 
ment model introduced in this chapter has a fixed exponent 6 = 3, but, 
as mentioned, real-world networks have exponents in the range [2, 3]. To 
solve this issue, extensions have been proposed in [Krapivsky, Redner, and 
Leyvraz, 2000; Albert and Barabasi, 2000], 

4.7 Exercises 

Properties of Real-World Networks 

1. A scale invariant function /(.) is one such that, for a scalar a, 

f{ax) = a c f(x), (4.34) 

for some constant c. Prove that the power-law degree distribution is 
scale invariant. 

Random Graphs 

2. Assuming that we are interested in a sparse random graph, what should 
we choose as our p value? 

3. Construct a random graph as follows. Start with n nodes and a given 
k. Generate all the possible combinations of k nodes. For each combi¬ 
nation, create a 6-cycle with probability prrs, where a is a constant. 

• Calculate the node mean degree and the clustering coefficient. 
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• What is the node mean degree if you create a complete graph instead 
of the £-cycle? 

4. When does phase transition happen in the evolution of random graphs? 
What happens in terms of changes in network properties at that time? 

Small-World Model 

5. Show that in a regular lattice the number of connections between neigh¬ 
bors is given by |c(c — 2), where c is the average degree. 

6. Show how the clustering coefficient can be computed in a regular lattice 
of degree k. 

7. Why are random graphs incapable of modeling real-world graphs? 
What are the differences between random graphs, regular lattices, and 
small-world models? 

8. Compute the average path length in a regular lattice. 

Preferential Attachment Model 

9. As a function of k, what fraction of pages on the web have k in-links, 
assuming that a normal distribution governs the probability of web¬ 
pages choosing their links? What if we have a power-law distribution 
instead? 

10. In the Barabasi-Albert model (BA) two elements are considered: growth 
and preferential attachment. The growth (G) is added to the model by 
allowing new nodes to connect via m edges. The preferential attachment 
(A) is added by weighting the probability of connection by the degree. 
For the sake of brevity, we will consider the model as BA = A + G. 
Now, consider models that only have one element: G, or A, and not 
both. In the G model, the probability of connection is uniform (P = 
mo+r-i )’ anc * ' n l ' lc num ber of nodes remain the same throughout the 
simulation and no new node is added. In A, at each time step, a node 
within the network is randomly selected based on degree probability 
and then connected to another one within the network. 

• Compute the degree distribution for these two models. 

• Determine if these two models generate scale-free networks. What 
does this prove? 
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Mountains of raw data are generated daily by individuals on social media. 
Around 6 billion photos are uploaded monthly to Facebook, the blogosphere 
doubles every five months, 72 hours of video are uploaded every minute 
to YouTube, and there are more than 400 million daily tweets on Twitter. 
With this unprecedented rate of content generation, individuals are easily 
overwhelmed with data and find it difficult to discover content that is relevant 
to their interests. To overcome the challenge, they need tools that can analyze 
these massive unprocessed sources of data (i.e., raw data) and extract useful 
patterns from them. Examples of useful patterns in social media are those 
that describe online purchasing habits or individuals’ website visit duration. 
Data mining provides the necessary tools for discovering patterns in data. 
This chapter outlines the general process for analyzing social media data 
and ways to use data mining algorithms in this process to extract actionable 
patterns from raw data. 

The process of extracting useful patterns from raw data is known as 
Knowledge discovery in databases (KDD) . It is illustrated in Figure 5.1. The 
KDD process takes raw data as input and provides statistically significant 
patterns found in the data (i.e., knowledge) as output. From the raw data, a 
subset is selected for processing and is denoted as target data. Target data 
is preprocessed to make it ready for analysis using data mining algorithm. 
Data mining is then performed on the preprocessed (and transformed) 
data to extract interesting patterns. The patterns are evaluated to ensure 
their validity and soundness and interpreted to provide insights into the 
data. 

In social media mining, the raw data is the content generated by individ¬ 
uals, and the knowledge encompasses the interesting patterns observed in 
this data. For example, for an online book seller, the raw data is the list of 
books individuals buy, and an interesting pattern could describe books that 
individuals both buy. 


KNOWLEDGE 
DISCOVERY IN 
DATABASES 
(KDD) 
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Figure 5.1. Knowledge Discovery in Databases (KDD) process. 


To analyze social media, we can either collect this raw data or use 
available repositories that host collected data from social media sites. 1 
When collecting data, we can either use APIs provided by social media sites 
for data collection or scrape the information from those sites. In either case, 
these sites are often networks of individuals where one can perform graph 
traversal algorithms to collect information from them. In other words, we 
can start collecting information from a subset of nodes on a social network, 
subsequently collect information from their neighbors, and so on. The data 
collected this way needs to be represented in a unified format for analysis. 
For instance, consider a set of tweets in which we are looking for common 
patterns. To find patterns in these tweets, they need to be first represented 
using a consistent data format. In the next section, we discuss data, its 
representation, and its types. 


5.1 Data 

In the KDD process, data is represented in a tabular format. Consider the 
example of predicting whether an individual who visits an online book 
seller is going to buy a specific book. This prediction can be performed 
by analyzing the individual’s interests and previous purchase history. For 
instance, when John has spent a lot of money on the site, has bought similar 
books, and visits the site frequently, it is likely for John to buy that specific 
book. John is an example of an instance. Instances are also called points, 
data points, or observations. A dataset consists of one or more instances: 
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Attributes 

Class 

Name 

Money Spent 

Bought Similar 

Visits 

Will Buy 

John 

High 

Yes 

Frequently 

? 

Mary 

High 

Yes 

Rarely 

Yes 


INSTANCE, 

POINT, 

DATA POINT, 
OR 

OBSERVATION 


A dataset is represented using a set of features, and an instance is rep¬ 
resented using values assigned to these features. Features are also known 
as measurements or attributes. In this example, the features are Name, 
Money Spent, Bought Similar, and Visits; feature values for the 
first instance are John, High, Yes, and Frequently. Given the feature 
values for one instance, one tries to predict its class (or class attribute) 
value. In our example, the class attribute is Will Buy, and our class value 
prediction for first instance is Yes. An instance such as John in which the 
class attribute value is unknown is called an unlabeled instance. Similarly, a 
labeled instance is an instance in which the class attribute value in known. 
Mary in this dataset represents a labeled instance. The class attribute is 
optional in a dataset and is only necessary for prediction purposes. One 
can have a dataset in which no class attribute is present, such as a list of 
customers and their characteristics. 

There are different types of features based on the characteristics of the 
feature and the values they can take. For instance, Money Spent can be 
represented using numeric values, such as $25. In that case, we have a 
continuous feature, whereas in our example it is a discrete feature, which 
can take a number of ordered values: (High, Normal, Low}. 

Different types of features were first introduced by psychologist Stanley 
Smith Stevens [1996] as “levels of measurement” in the theory of scales. 
He claimed that there are four types of features. For each feature type, there 
exists a set of permissible operations (statistics) using the feature values 
and transformations that are allowed. 


FEATURES, 

MEASUREMENTS, 

OR 

ATTRIBUTES 


LABELED 

AND 

UNLABELED 


LEVELS OF 
MEASUREMENT 


• Nominal (categorical). These features take values that are often rep¬ 
resented as strings. For instance, a customer’s name is a nominal 
feature. In general, a few statistics can be computed on nominal fea¬ 
tures. Examples are the chi-square statistic (y 2 ) and the mode (most 
common feature value). For example, one can find the most com¬ 
mon first name among customers. The only possible transformation 
on the data is comparison. For example, we can check whether our 
customer’s name is John or not. Nominal feature values are often 
presented in a set format. 
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• Ordinal. Ordinal features lay data on an ordinal scale. In other words, 
the feature values have an intrinsic order to them. In our example, 
Money Spent is an ordinal feature because a High value for Money 
Spent is more than a Low one. 

• Interval. In interval features, in addition to their intrinsic ordering, 
differences are meaningful whereas ratios are meaningless. For inter¬ 
val features, addition and subtraction are allowed, whereas multipli¬ 
cations and division are not. Consider two time readings: 6:16 PM 
and 3:08 PM. The difference between these two time readings is 
meaningful (3 hours and 8 minutes); however, there is no meaning to 

6:16 PM ,fy 
3:08 PM r 

• Ratio. Ratio features, as the name suggests, add the additional prop¬ 
erties of multiplication and division. An individual’s income is an 
example of a ratio feature where not only differences and additions 
are meaningful but ratios also have meaning (e.g., an individual’s 
income can be twice as much as John’s income). 

In social media, individuals generate many types of nontabular data, such 
as text, voice, or video. These types of data are first converted to tabular 
data and then processed using data mining algorithms. For instance, voice 
can be converted to feature values using approximation techniques such 
as the fast Fourier transform (FFT) and then processed using data mining 
algorithms. To convert text into the tabular format, we can use a process 
denoted as vectorization. A variety of vectorization methods exist. A well- 
known method for vectorization is the vector-space model introduced by 
vectorization Salton, Wong, and Yang [1975]. 


Vector Space Model 

In the vector space model, we are given a set of documents D. Each doc¬ 
ument is a set of words. The goal is to convert these textual documents to 
[feature] vectors. We can represent document i with vector d h 

di = {w u , w 2 j, (5.1) 

where Wjj represents the weight for word j that occurs in document i and 
N is the number of words used for vectorization. To compute Wjj, we 
can set it to 1 when the word j exists in document i and 0 when it does 
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not. We can also set it to the number of times the word j is observed in 
document i . A more generalized approach is to use the term frequency- 
inverse document frequency (TF-IDF) weighting scheme. In the TF-IDF TF -idf 
scheme, Wjj is calculated as 


Wj,i = tfj,i x idfj, (5.2) 

where tf hI is the frequency of word j in document i. idfj is the inverse 
frequency of word j across all documents, 

\D\ 

idfj = log,-, (5.3) 

|{document e D \ j e document}! 

which is the logarithm of the total number of documents divided by the 
number of documents that contain word j. TF-IDF assigns higher weights 
to words that are less frequent across documents and, at the same time, have 
higher frequencies within the document they are used. This guarantees that 
words with high TF-IDF values can be used as representative examples of 
the documents they belong to and also, that stop words, such as “the,” which 
are common in all documents, are assigned smaller weights. 

Example 5.1. Consider the words “apple" and “orange" that appear 10 
and 20 times in document d\. Let \D\ = 20 and assume the word “apple" 
only appears in document d\ and the word “orange” appears in all 20 
documents. Then, TF-IDF values for “apple ” and “orange ” in document 
d\ are 


tf — idf (“apple”, di) = 10 x log, — = 43.22, (5.4) 

20 

tf — idf (“orange”, d\) = 20 x log, — = 0. (5.5) 

Example 5.2. Consider the following three documents: 


d\ = 

“social media mining” 

(5.6) 

d 2 = 

“social media data ” 

(5.7) 

d 2 = 

“financial market data ” 

(5.8) 
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The tf values are as follows: 



social 

media 

mining 

data 

financial 

market 

d\ 

1 

1 

i 

0 

0 

0 

di 

1 

1 

0 

1 

0 

0 

di 

0 

0 

0 

1 

1 

1 


The idf values are 

idf social = log 2 (3/2) = 0.584 (5.9) 

zV/media = log 2 (3/2) = 0.584 (5.10) 

/^/mining = log 2 (3/l) = 1.584 (5.11) 

l d/dat'd = log 2 (3/2) = 0.584 (5.12) 

^/financial = log 2 (3/l) = 1.584 (5.13) 

^/market = log 2 (3/l) = 1.584. (5.14) 


The TF-IDF values can be computed by multiplying tf values with the 
idf values: 



social 

media 

mining 

data 

financial 

market 

d\ 

0.584 

0.584 

1.584 

0 

0 

0 


0.584 

0.584 

0 

0.584 

0 

0 


0 

0 

0 

0.584 

1.584 

1.584 


After vectorization, documents are converted to vectors, and common 
data mining algorithms can be applied. However, before that can occur, the 
quality of data needs to be verified. 

5.1.1 Data Quality 

When preparing data for use in data mining algorithms, the following four 
data quality aspects need to be verified: 

1. Noise is the distortion of the data. This distortion needs to be removed 
or its adverse effect alleviated before running data mining algorithms 
because it may adversely affect the performance of the algorithms. 
Many filtering algorithms are effective in combating noise effects. 

2. Outliers are instances that are considerably different from other 
instances in the dataset. Consider an experiment that measures the 
average number of followers of users on Twitter. A celebrity with 
many followers can easily distort the average number of followers per 
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individuals. Since the celebrities are outliers, they need to be removed 
from the set of individuals to accurately measure the average number 
of followers. Note that in special cases, outliers represent useful 
patterns, and the decision to removing them depends on the context 
of the data mining problem. 

3. Missing Values are feature values that are missing in instances. 
For example, individuals may avoid reporting profile information 
on social media sites, such as their age, location, or hobbies. To 
solve this problem, we can (1) remove instances that have missing 
values, (2) estimate missing values (e.g., replacing them with the 
most common value), or (3) ignore missing values when running 
data mining algorithms. 

4. Duplicate data occur when there are multiple instances with the 
exact same feature values. Duplicate blog posts, duplicate tweets, 
or profiles on social media sites with duplicate information are 
all instances of this phenomenon. Depending on the context, these 
instances can either be removed or kept. For example, when instances 
need to be unique, duplicate instances should be removed. 

After these quality checks are performed, the next step is preprocessing 
or transformation to prepare the data for mining. 

5.2 Data Preprocessing 

Often, the data provided for data mining is not immediately ready. Data pre¬ 
processing (and transformation in Figure 5.1) prepares the data for mining. 
Typical data preprocessing tasks are as follows: 

• Aggregation. This task is performed when multiple features need 
to be combined into a single one or when the scale of the features 
change. For instance, when storing image dimensions for a social 
media website, one can store by image width and height or equiva¬ 
lently store by image area (width x height). Storing image area saves 
storage space and tends to reduce data variance; hence, the data has 
higher resistance to distortion and noise. 

• Discretization. Consider a continuous feature such as money spent 
in our previous example. This feature can be converted into discrete 
values - High, Normal, and Low - by mapping different ranges 
to different discrete values. The process of converting continuous 
features to discrete ones and deciding the continuous range that is 
being assigned to a discrete value is called discretization. 
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• Feature Selection. Often, not all features gathered are useful. Some 
may be irrelevant, or there may be a lack of computational power 
to make use of all the features, among many other reasons. In these 
cases, a subset of features are selected that could ideally enhance the 
performance of the selected data mining algorithm. In our example, 
customer’s name is an irrelevant feature to the value of the class 
attribute and the task of predicting whether the individual will buy 
the given book or not. 

• Feature Extraction. In contrast to feature selection ,feature extraction 
converts the current set of features to a new set of features that can 
perform the data mining task better. A transformation is performed 
on the data, and a new set of features is extracted. The example we 
provided for aggregation is also an example of feature extraction 
where a new feature (area) is constructed from two other features 
(width and height). 

• Sampling. Often, processing the whole dataset is expensive. With the 
massive growth of social media, processing large streams of data is 
nearly impossible. This motivates the need for sampling. In sampling, 
a small random subset of instances are selected and processed instead 
of the whole data. The selection process should guarantee that the 
sample is representative of the distribution that governs the data, 
thereby ensuring that results obtained on the sample are close to ones 
obtained on the whole dataset. The following are three major sampling 
techniques: 

1. Random sampling. In random sampling, instances are selected 
uniformly from the dataset. In other words, in a dataset of size 
n, all instances have equal probability ^ of being selected. Note 
that other probability distributions can also be used to sample the 
dataset, and the distribution can be different from uniform. 

2. Sampling with or without replacement. In sampling with 
replacement, an instance can be selected multiple times in the 
sample. In sampling without replacement, instances are removed 
from the selection pool once selected. 

3. Stratified sampling. In stratified sampling, the dataset is first 
partitioned into multiple bins; then a fixed number of instances 
are selected from each bin using random sampling. This technique 
is particularly useful when the dataset does not have a uniform 
distribution for class attribute values (i.e., class imbalance). For 
instance, consider a set of 10 females and 5 males. A sample of 
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5 females and 5 males can be selected using stratified sampling 
from this set. 

In social media, a large amount of information is represented in 
network form. These networks can be sampled by selecting a subset 
of their nodes and edges. These nodes and edges can be selected 
using the aforementioned sampling methods. We can also sample 
these networks by starting with a small set of nodes (seed nodes) and 
sample 

(a) the connected components they belong to; 

(b) the set of nodes (and edges) connected to them directly; or 

(c) the set of nodes and edges that are within n-hop distance from 
them. 

After preprocessing is performed, the data is ready to be mined. Next, 
we discuss two general categories of data mining algorithms and how each 
can be evaluated. 


5.3 Data Mining Algorithms 

Data mining algorithms can be divided into several categories. Here, we 
discuss two well-established categories: supervised learning and unsuper¬ 
vised learning. In supervised learning, the class attribute exists, and the 
task is to predict the class attribute value. Our previous example of pre¬ 
dicting the class attribute “will buy” is an example of supervised learning. 
In unsupervised learning, the dataset has no class attribute, and our task is 
to find similar instances in the dataset and group them. By grouping these 
similar instances, one can find significant patterns in a dataset. For example, 
unsupervised learning can be used to identify events on Twitter, because the 
frequency of tweeting is different for various events. By using unsupervised 
learning, tweets can be grouped based on the times at which they appear 
and hence, identify the tweets’ corresponding real-world events. Other cat¬ 
egories of data mining algorithms exist; interested readers can refer to the 
bibliographic notes for pointers to these categories. 

5.4 Supervised Learning 

The first category of algorithms, supervised learning algorithms, are those 
for which the class attribute values for the dataset are known before running 
the algorithm. This data is called labeled data or training data. Instances in 
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Figure 5.2. Supervised Learning. 


this set are tuples in the format (x, y), where x is a vector and y is the class 
attribute, commonly a scalar. Supervised learning builds a model that maps 
x to y. Roughly, our task is to find a mapping m (.) such that m(x) = y. We 
are also given an unlabeled dataset or test dataset, in which instances are in 
the form (x, ?) and y values are unknown. Given m(.) learned from training 
data and x of an unlabeled instance, we can compute m(x), the result of 
which is prediction of the label for the unlabeled instance. 

Consider the task of detecting spam emails. A set of emails is given where 
users have manually identified spam versus non-spam (training data). Our 
task is to use a set of features such as words in the email (x) to identify 
the spam/non-spam status (y) of unlabeled emails (test data). In this case, 
y = [spam, non-spam}. 

Supervised learning can be divided into classification and regression. 
When the class attribute is discrete, it is called classification; when the 
class attribute is continuous, it is regression. We introduce classification 
methods such as decision tree learning, naive Bayes classifier, k-nearest 
neighbor classifier, and classification with network information and regres¬ 
sion methods such as linear regression and logistic regression. We also 
introduce how supervised learning algorithms are evaluated. Before we 
delve into supervised learning techniques, we briefly discuss the systematic 
process of a supervised learning algorithm. 

This process is depicted in Figure 5.2. It starts with a training set (i.e., 
labeled data) where both features and labels (class attribute values) are 
known. A supervised learning algorithm is run on the training set in a pro¬ 
cess known as induction. In the induction process, the model is generated. 
The model maps the feature values to the class attribute values. The model 
is used on a test set in which the class attribute value is unknown to predict 
these unknown class attribute values ( deduction process). 
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Table 5.1. ^4 Sample Dataset. In this dataset, features are 
characteristics of individuals on Twitter, and the class attribute 
denotes whether they are influential or not 


ID 

Celebrity 

Verified Account 

# Followers 

Influential? 

1 

Yes 

No 

1.25 M 

No 

2 

No 

Yes 

1 M 

No 

3 

No 

Yes 

600 K 

No 

4 

Yes 

Unknown 

2.2 M 

No 

5 

No 

No 

850 K 

Yes 

6 

No 

Yes 

750 K 

No 

7 

No 

No 

900 K 

Yes 

8 

No 

No 

700 K 

No 

9 

Yes 

Yes 

1.2 M 

No 

10 

No 

Unknown 

950 K 

Yes 


5.4.1 Decision Tree Learning 

Consider the dataset shown in Table 5.1. The last attribute represents the 
class attribute, and the other attributes represent the features. In decision tree 
classification, a decision tree is learned from the training dataset, and that 
tree is later used to predict the class attribute value for instances in the test 
dataset. As an example, two learned decision trees from the dataset shown 
in Table 5.1 are provided in Figure 5.3. As shown in this figure, multiple 
decision trees can be learned from the same dataset, and these decision trees 
can both correctly predict the class attribute values for all instances in the 
dataset. Construction of decision trees is based on heuristics, as different 
heuristics generate different decision trees from the same dataset. 


Splitting Attributes 



(a) Learned Decision Tree 1 


(b) Learned Decision Tree 2 


Figure 5.3. Decision Trees Learned from Data Provided in Table 5.1. 
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Decision trees classify examples based on their feature values. Each non¬ 
leaf node in a decision tree represents a feature, and each branch represents 
a value that the feature can take. Instances are classified by following a path 
that starts at the root node and ends at a leaf by following branches based on 
instance feature values. The value of the leaf determines the class attribute 
value predicted for the instance (see Figure 5.3). 

Decision trees are constructed recursively from training data using a 
top-down greedy approach in which features are sequentially selected. In 
Figure 5.3(a), the feature selected for the root node is Celebrity. After 
selecting a feature for each node, based on its values, different branches 
are created: For Figure 5.3(a), since the Celebrity feature can only take 
either Yes or No, two branches are created: one labeled Yes and one labeled 
No. The training set is then partitioned into subsets based on the feature 
values, each of which fall under the respective feature value branch; the 
process is continued for these subsets and other nodes. In Figure 5.3(a), 
instances 1, 4, and 9 from Table 5.1 represent the subset that falls under the 
Celebrity=Yes branch, and the other instances represent the subset that 
falls under the Celebrity=No branch. 

When selecting features, we prefer features that partition the set of 
instances into subsets that are more pure. A pure subset has instances 
that all have the same class attribute value. In Figure 5.3(a), the instances 
that fall under the left branch of the root node (Celebrity=Yes) form 
a pure subset in which all instances have the same class attribute value 
Inf luential?=No. When reaching pure subsets under a branch, the deci¬ 
sion tree construction process no longer partitions the subset, creates a leaf 
under the branch, and assigns the class attribute value for subset instances 
as the leaf’s predicted class attribute value. In Figure 5.3(a), the instances 
that fall under the right branch of the root node form an impure dataset; 
therefore, further branching is required to reach pure subsets. Purity of 
subsets can be determined with different measures. A common measure of 
purity is entropy. Over a subset of training instances, T, with a binary class 
attribute (values e {+, —}), the entropy of T is defined as 

entropy(T) = -p+ log p+ - p- log p-, (5.15) 

where p + is the proportion of instances with + class attribute value in T 
and p- is the proportion of instances with - class attribute value. 


Example 5.3. Assume that there is a subset T, containing 10 instances. 
Seven instances have a positive class attribute value, and three instances 
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have a negative class attribute value [7+, 3—]. The entropy for subset T is 
entropy(T^) = - ^ log ^ ^ log ^ = 0.881. (5.16) 

Note that if the number of positive and negative instances in the set are 
equal (p + = p~ = 0.5), then the entropy is 1. 

In a pure subset, all instances have the same class attribute value and the 
entropy is 0. If the subset being measured contains an unequal number of 
positive and negative instances, the entropy is between 0 and 1 . 


5.4.2 Naive Bayes Classifier 


Among many methods that use the Bayes theorem, the naive Bayes classifier 
(NBC) is the simplest. Given two random variables X and 7, Bayes theorem 
states that 

P(X\Y)P(Y) 


P(Y\X) = 


P(X) 


(5.17) 


In NBC, 7 represents the class variable and X represents the instance 
features. Let X be (xi, X 2 , X 3 ,..., x,„), where x, represents the value of 
feature i. Let {yi,yi ,... ,y„} represent the values the class attribute 7 
can take. Then, the class attribute value of instance X can be calculated by 
measuring 

argma xP(yj\X). (5.18) 


Based on the Bayes theorem, 

P(yi\X) = 


P(X\y,)Pt Vi ) 

P(X) 


(5.19) 


Note that P(X) is constant and independent of y,, so we can ignore the 
denominator of Equation 5.19 when maximizing Equation 5.18. The NBC 
also assumes conditional independence to make the calculations easier; that 
is, given the class attribute value, other feature attributes become condi¬ 
tionally independent. This condition, though unrealistic, performs well in 
practice and greatly simplifies calculation. 

P(X\y l )= nj =l P(xj\yi). (5.20) 

Substituting P{X\y t ) from Equation 5.20 in Equation 5.19, we get 

(n^Pixfy^Piyi) 


P{yi\X) = 


P(X) 


We clarify how the naive Bayes classifier works with an example. 


(5.21) 
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Table 5.2. Naive Bayes Classifier (NBC) Toy Dataset 


No. 

Outlook (0) 

Temperature (T) 

Humidity (H) 

Play Golf (PG) 

1 

sunny 

hot 

high 

N 

2 

sunny 

mild 

high 

N 

3 

overcast 

hot 

high 

Y 

4 

rain 

mild 

high 

Y 

5 

sunny 

cool 

normal 

Y 

6 

rain 

cool 

normal 

N 

7 

overcast 

cool 

normal 

Y 

8 

sunny 

mild 

high 

? 


Example 5.4. Consider the dataset in Table 5.2. 

We predict the label for instance 8 (is) using the naive Bayes classifier 
and the given dataset. We have 


P(PG = Y\is) 


P(is\PG = Y)P(PG = Y) 

mj 

P(0 = Sunny, T = mild, H = high\PG = Y) 

P(PG = Y) 
x —- 

P(h) 

P(0 = Sunnv\PG = Y) x P(T = mild\PG = Y) 

P(PG = Y) 

x P(H = high\PG = Y) x —-- 

P(h) 


112 f _ 1 

4 X 4 X 4 X J(h) _ 28P(ig)' 


(5.22) 


Similarly, 


P(PG = N\i s ) 


P(is\PG = N)P(PG = N) 

m) 

P(0 = Sunny, T = mild, H = high\PG = N) 
P(PG = N) 

X Pfis) 

P(0 = Sunny\PG = N) x P(T = mild\PG = N) 

P(PG = N) 

x P(H = high\PG = N)x — -- 

P(k) 

2 i 2 f _ 4 

3 X 3 X 3 X P(h) _ 63 P(is)' 


(5.23) 

















5.4 Supervised Learning 


119 


Algorithm 5.1 ^-Nearest Neighbor Classifier 

Require: Instance i, A Dataset of Real-Value Attributes, k (number of 
neighbors), distance measure d 

l: return Class label for instance i 

2 : Compute k nearest neighbors of instance i based on distance mea¬ 
sure d. 

3: l = the majority class label among neighbors of instance i. If more 
than one majority label, select one randomly. 

4: Classify instance i as class / 


Since ^P(i s ) > 28 f(' 8 ) ’ f or instance i%, and based on NBC calculations, 
we have Play Golf = N. 

5.4.3 Nearest Neighbor Classifier 

As the name suggests, ^-nearest neighbor or kNN uses the k nearest 
instances, called neighbors, to perform classification. The instance being 
classified is assigned the label (class attribute value) that the majority of 
its k neighbors are assigned. The algorithm is outlined in Algorithm 5.1. 
When k = 1, the closest neighbor’s label is used as the predicted label for 
the instance being classified. To determine the neighbors of an instance, we 
need to measure its distance to all other instances based on some distance 
metric. Commonly, Euclidean distance is employed; however, for higher 
dimensional spaces, Euclidean distance becomes less meaningful and other 
distance measures can be used. 

Example 5.5. Consider the example depicted in Figure 5.4. As shown, 
depending on the value ofk, different labels can be predicted for the circle. 
In our example, k = 5 and k = 9 generate different labels for the instance 
(triangle and square, respectively). 

As shown in our example, an important issue with the ^-nearest neighbor 
algorithm is the choice of k. The choice of k can easily change the label of 
the instance being predicted. In general, we are interested in a value of k 
that maximizes the performance of the learning algorithm. 


5.4.4 Classification with Network Information 

Consider a friendship network on social media and a product being marketed 
to this network. The product seller wants to know who the potential buyers 
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Figure 5.4. A'-Nearest Neighbor Example. In this figure, our goal is to predict the label 
for the instance shown using a circle. When k = 5, the predicted label is ▲ and when 
k = 9 the predicted label is ■. 


are for this product. Assume we are given the network with the list of 
individuals who decided to buy or not buy the product. Our goal is to predict 
the decision for the undecided individuals. This problem can be formulated 
as a classification problem based on features gathered from individuals. 
However, in this case, we have additional friendship information that may be 
helpful in building more accurate classification models. This is an example 
of classification with network information. 

Assume we are not given any profile information, but only connections 
and class labels (i.e., the individual bought/will not buy). By using the rows 
of the adjacency matrix of the friendship network for each node as features 
and the decision (e.g., buy/not buy) as a class label, we can predict the label 
for any unlabeled node using its connections; that is, its row in the adjacency 
matrix. Let P(y, = 1 1 /V(u ; )J denote the probability of node o, having 
class attribute value 1 given its neighbors. Individuals’ decisions are often 
highly influenced by their immediate neighbors. Thus, we can approximate 
P(ji = 1) using the neighbors of the individual by assuming that 

P(y i = \)XiP(y i = 1|TO)). (5.24) 


WEIGHTED- 

VOTE 

RELATIONAL- 

NEIGHBOR 

CLASSIFIER 


We can estimate P(y, = \\N{V,)) via different approaches. The 
weighted-vote relational-neighbor (wvRN) classifier is one such approach. 
It estimates P(y , = l|A(u,)) as 

P(yi = 11 N(vj )) = —' - P(yj = m(Vj)). (5.25) 

I VjeN(Vi) 


In other words, the probability of node ty having class attribute value 
1 is the average probability of its neighbors having this class attribute 
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Figure 5.5. Weighted-Vote Relational-Neighbor (wvRN) Example. Labeled nodes have 
their class attribute values next to them. The goal is to predict labels for other nodes in 
the network. 


value. Note that P{y t = 11 /V(/;, )) is only calculated for u, ’s that are unla¬ 
beled. For node Vk, which is labeled 1, p(jk = 11 N(Vkj) = 1 and the 
probability is never estimated. Similarly, if Vk will not buy the product, 
p(yk = 0| A r (ok)) = 1 • Since the probability of a node having class attribute 
value 1 depends on the probability of its neighbors having the same value, 
the probability of the node is affected if the probabilities of its neighbors 
change. Thus, the order of updating nodes can affect the estimated prob¬ 
abilities. In practice, one follows an order sequence for estimating node 
probabilities. Starting with an initial probability estimate for all unlabeled 
nodes and following this order, we estimate probabilities until probabilities 
no longer change (i.e., converge). We can assume the initial probability 
estimate to be Ply, = 11 /V(?j ; )) = 0.5 for all unlabeled nodes. ’ We show 
how the wvRN classifier learns probabilities using the following example. 

Example 5.6. Consider the network given in Figure 5.5. Labeled nodes 
have their class attribute values next to them. Therefore, 


P(yi = W(vi))= 1 , 

(5.26) 

P(y .2 = i\N(v 2 )) = 1 , 

(5.27) 

P(y 5 = 1| N(p s )) = 0. 

(5.28) 


We have three unlabeled nodes {» 3 , 04 , og}- We choose their natural order 
to update their probabilities. Thus, we start with 03: 


^ 1 ^( 03 )) 

1 

= \N(v 3 )\ 


E P(yj = W(Pj)) 

vjeN(t 13 ) 


= X -{P{yx = 11 AT(t> r)) + P(j ; 2 = l|tf(o 2 )) + p (ys = 11 ^( 05 ))) 
= ^(1 + 1+0) = 0.67. 


(5.29) 
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P(yj | N( 1 J 3 )) does not need to be computed again because its neighbors 
are all labeled (thus, this probability estimation has converged). Similarly, 

P(y 4 \N(iu )) = i(l + 0.5) = 0.75, (5.30) 

P<J’ 6 \N(d 6 )) = ^(0.75 + 0) = 0.38. (5.31) 

We need to recompute both _P(v 4 |./V( 04 )) and P(y’(,\N(vf)) until conver¬ 
gence. Let P(t)(yi\N(Pi)) denote the estimated probability after t computa¬ 
tions. Then, 


P w tv 4 \N(v 4 )) = -(1 + 0.38) = 0.69, 

(5.32) 

P w (y 6 \N(v 6 )) = ^(0.69 + 0) = 0.35, 

(5.33) 

P( 2 )(y 4 \N(v 4 )) = l -(\ + 0.35) = 0.68, 

(5.34) 

P( 2 )(ty 6 1 N (u6)) = ^(0.68 + 0) = 0.34, 

(5.35) 

P 0 ) (y 4 \N(v 4 )) = + 0.34) = 0.67, 

(5.36) 

P( 3 )(y 6 \N(v 6 )) = ^(0.67 + 0) = 0.34, 

(5.37) 

P( 4 )(y 4 \N (u 4 )) = * (1 + 0.34) = 0.67, 

(5.38) 

P(4 )Lv 6 \N{v 6 ))= ^(0.67 + 0) = 0.34. 

(5.39) 


After four iterations, both probabilities converge. So, from these proba¬ 
bilities (Equations 5.29, 5.38, and 5.39), we can tell that nodes oj and u 4 
will likely have class attribute value 1 and node V(, will likely have class 
attribute value 0 . 


5.4.5 Regression 

In classification, class attribute values are discrete. In regression, class 
attribute values are real numbers. For instance, we wish to predict the 
stock market value (class attribute) of a company given information about 
the company (features). The stock market value is continuous; therefore, 
regression must be used to predict it. The input to the regression method is 
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a dataset where attributes are represented using x\, X 2 , ■ ■ ■, x m (also known 
as regressors) and class attribute is represented using Y (also known as the 
dependent variable), where the class attribute is a real number. We want to 
find the relation between Y and the vector X = (xi, X 2 ,, x m ). We discuss 
two basic regression techniques: linear regression and logistic regression. 

Linear Regression 

In linear regression, we assume that the class attribute Y has a linear relation 
with the regressors (feature set) X by considering a linear error e. In other 
words. 


Y = XW + e, (5.40) 

where W represents the vector of regression coefficients. The problem of 
regression can be solved by estimating W using the training dataset and 
its labels Y such that fitting error e is minimized. A variety of methods 
have been introduced to solve the linear regression problem, most of which 
use least squares or maximum-likelihood estimation. We employ the least 
squares technique here. Interested readers can refer to the bibliographic 
notes for more detailed analyses. In the least square method, we find W 
using regressors X and labels Y such that the square of fitting error epsilon 
is minimized. 


e 2 = ||e 2 || = \\Y-XW\\ 2 . 


(5.41) 


To minimize e, we compute the gradient and set it to zero to find the 
optimal W: 


8 \\Y — XW\\ 2 
dW 


(5-42) 


We know that for any X, \\X\\ 2 = (X T X)\ therefore, 


d\\Y-XW\\ 2 _ 8(Y - XWf(Y - XW) 
dW “ dW 

_ d(Y r - W T X T )(Y - XW) 

~ dW 

_ d(Y T Y - Y T XW - W T X T Y + W T X T XW) 
dW 

= -2X T Y + 2X T XW = 0. (5.43) 
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Therefore, 

X t Y = X t XW. (5.44) 

Since X T X is invertible for any X, by multiplying both sides by 
(X T X) -1 , we get 

W = (X t X)~ 1 X t Y. (5.45) 

Alternatively, one can compute the singular value decomposition (SVD) 
ofX= UZV T : 


W = (X T X)-'X T Y 

= (vyu t uy.v t )- 1 vy.u t y 

= ( VI. 2 V t )- 1 V'LU t Y 

= vz~ 2 v r vi :u t y 

= VY~ X U T Y, (5.46) 

and since we can have zero singular values, 

W=VY + U t Y, (5.47) 

where E + is the submatrix of X with nonzero singular values. 


Logistic Regression 

Logistic regression provides a probabilistic view of regression. For simplic¬ 
ity, let us assume that the class attribute can only take values of 0 and 1. 
Formally, logistic regression finds probability p such that 

P(Y=\\X) = p, (5.48) 

where X is the vector of features and Y is the class attribute. We can use 
linear regression to approximate p. In other words, we can assume that 
probability p depends on X; that is, 

P = PX, (5.49) 

where /? is a vector of coefficients. Unfortunately, ft X can take unbounded 
values because X can take on any value and there are no constraints on how 
/?’s are chosen. However, probability p must be in range [0, 1], Since [>X 
is unbounded, we can perform a transformation g(.) on p such that it also 
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Figure 5.6. Logistic Function. 


becomes unbounded. Then, we can fit g(p) to fiX. One such transformation 
g(-) for p is 


g(p) = l n ~r ^~—’ ( 5 - 50 ) 

1 - p 

which for any p between [0, 1] generates a value in range [—oo, +oo]. 
The function g(.) is known as the logit function. The transformed p can be 
approximated using a linear function of feature vector X, 

g(P) = PX. (5.51) 

Combining Equations 5.50 and 5.51 and solving for p, we get 

eP x 1 

p - e P x +\~ e~P x + 1' <5 ' 52) 

This function is known as the logistic function and is plotted in Figure 5.6. 
An interesting property of this function is that, for any real value (negative 
to positive infinity), it will generate values between 0 and 1. In other words, 
it acts as a probability function. 

Our task is to find J3’s such that P(Y\X) is maximized. Unlike linear 
regression models, there is no closed form solution to this problem, and it 
is usually solved using iterative maximum likelihood methods (See biblio¬ 
graphic Notes). 

After /Os are found, similar to the Naive Bayes Classifier (NBC), we 
compute the probability P(Y\X) using Equation 5.52. In a situation where 
the class attribute takes two values, when this probability is larger than 0.5, 
the class attribute is predicted 1; otherwise, 0 is predicted. 
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5.4.6 Supervised Learning Evaluation 


LEAVE-ONE- 

OUT 


i-FOLD 

CROSS 

VALIDATION 


Supervised learning algorithms often employ a training-testing framework 
in which a training dataset (i.e., the labels are known) is used to train a 
model and then the model is evaluated on a test dataset. The performance 
of the supervised learning algorithm is measured by how accurate it is in 
predicting the correct labels of the test dataset. Since the correct labels of 
the test dataset are unknown, in practice, the training set is divided into two 
parts, one used for training and the other used for testing. Unlike the original 
test set, for this test set the labels are known. Therefore, when testing, the 
labels from this test set are removed. After these labels are predicted using 
the model, the predicted labels are compared with the masked labels ( ground 
truth). This measures how well the trained model is generalized to predict 
class attributes. One way of dividing the training set into train/test sets is to 
divide the training set into k equally sized partitions, or folds, and then using 
all folds but one to train, with the one left out for testing. This technique is 
called leave-one-out training. Another way is to divide the training set into 
k equally sized sets and then run the algorithm k times. In round i , we use all 
folds but fold i for training and fold i for testing. The average performance 
of the algorithm over k rounds measures the generalization accuracy of the 
algorithm. This robust technique is known as k-fold cross validation. 

To compare the masked labels with the predicted labels, depending on 
the type of supervised learning algorithm, different evaluation techniques 
can be used. In classification, the class attribute is discrete so the values it 
can take are limited. This allows us to use accuracy to evaluate the classifier. 
The accuracy is the fraction of labels that are predicted correctly. Let n be 
the size of the test dataset and let c be the number of instances from the 
test dataset for which the labels were predicted correctly using the trained 
model. Then the accuracy of this model is 

c 

accuracy =—. (5.53) 

n 

In the case of regression, however, it is unreasonable to assume that 
the label can be predicted precisely because the labels are real values. A 
small variation in the prediction would result in extremely low accuracy. For 
instance, if we train a model to predict the temperature of a city in a given 
day and the model predicts the temperature to be 71.1 degrees Fahrenheit 
and the actual observed temperature is 71, then the model is highly accurate; 
however, using the accuracy measure, the model is 0% accurate. In general, 
for regression, we check if the predictions are highly correlated with the 
ground truth using correlation analysis, or we can fit lines to both ground 
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Table 5.3. Distance Measures 


Measure Name 

Formula 

Description 

Mahalanobis 

d(X, Y) = yj(X — Y) T I.- l (X-Y) 

X, Y are features vectors 
and E is the covariance 
matrix of the dataset 

Manhattan 
(Li norm) 

d(X, Y) = \ x i - Til 

X, Y are features vectors 

Lp-norm 

d(X,Y)=(Y, i \x i -y i \ n Y 

X, Y are features vectors 


truth and prediction results and check if these lines are close. The smaller 
the distance between these lines, the more accurate the models learned from 
the data. 


5.5 Unsupervised Learning 

Unsupervised learning is the unsupervised division of instances into groups 
of similar objects. In this topic, we focus on clustering. In clustering, the 
data is often unlabeled. Thus, the label for each instance is not known to the 
clustering algorithm. This is the main difference between supervised and 
unsupervised learning. 

Any clustering algorithm requires a distance measure. Instances are 
put into different clusters based on their distance to other instances. The 
most popular distance measure for continuous features is the Euclidean 
distance: 

d(X, Y) = \J (xi - yi) 2 + (x 2 - yi) 2 H- V (x„ - y„) 2 



~ yd 2 , 

i= 1 


(5.54) 


where X = {x \, X 2 , ■ ■ ■, x n ) and Y = (y \, >' 2 ,..., y„) are n-dimensional 
feature vectors in R". A list of some commonly used distance measures 
is provided in Table 5.3. 

Once a distance measure is selected, instances are grouped using it. Clus¬ 
ters are usually represented by compact and abstract notations. “Cluster 
centroids” are one common example of this abstract notation. Finally, clus¬ 
ters are evaluated. There is still a large debate on the issue of evaluating 
clustering because of the lack of cluster labels in unsupervised learning. 


CLUSTERING 
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Figure 5.7. A-Means Output on a Sample Dataset. Instances are two-dimensional vectors 
shown in the 2-D space. A-means is run with A = 6, and the clusters found are visualized 
using different symbols. 


Clustering validity and the definition of valid clusters are two of the chal¬ 
lenges in the ongoing research. 

5 . 5 .1 Clustering Algorithms 

There are many clustering algorithm types. In this section, we discuss parti- 
tional clustering algorithms, which are the most frequently used clustering 
algorithms. In Chapter 6, we discuss two other types of clustering algo¬ 
rithms: spectral clustering and hierarchical clustering. 


Partitional Algorithms 

Partitional clustering algorithms partition the dataset into a set of clusters. 
In other words, each instance is assigned to a cluster exactly once, and no 
instance remains unassigned to clusters. A-means Jain and Dubes [1999] 
is a well-known example of a partitional algorithm. The output of the A- 
means algorithm (A = 6) on a sample dataset is shown in Figure 5.7. In 
this figure, the dataset has two features, and instances can be visualized 
in a two-dimensional space. The instances are shown using symbols that 
represent the cluster to which they belong. The pseudocode for A-means 
algorithm is provided in Algorithm 5.2. 
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Algorithm 5.2 A'-Means Algorithm 

Require: A Dataset of Real-Value Attributes, k (number of Clusters) 
l: return A Clustering of Data into A'Clusters 

2 : Consider k random instances in the data space as the initial cluster 
centroids. 

3: while centroids have not converged do 

4: Assign each instance to the cluster that has the closest cluster 

centroid. 

5: If all instances have been assigned then recalculate the cluster 

centroids by averaging instances inside each cluster 

6 : end while 


The algorithm starts with k initial centroids. In practice, these centroids 
are randomly chosen instances from the dataset. These initial instances form 
the initial set of k clusters. Then, we assign each instance to one of these 
clusters based on its distance to the centroid of each cluster. The calculation 
of distances from instances to centroids depends on the choice of distance 
measure. Euclidean distance is the most widely used distance measure. After 
assigning all instances to a cluster, the centroids, are recomputed by taking 
the average (mean) of all instances inside the clusters (hence, the name 
A'-means). This procedure is repeated using the newly computed centroids. 
Note that this procedure is repeated until convergence. The most common 
criterion to determine convergence is to check whether centroids are no 
longer changing. This is equivalent to clustering assignments of the data 
instances stabilizing. In practice, the algorithm execution can be stopped 
when the Euclidean distance between the centroids in two consecutive 
steps is bounded above by some small positive e. As an alternative, Ar-means 
implementations try to minimize an objective function. A well-known objec¬ 
tive function in these implementations is the squared distance error: 

k n(i) 

EE iK- ~ c t ii 2 ’ ( 5 - 55 ) 

i= 1 7=1 

where x' is the / th instance of cluster i, n{i) is the number of instances 
in cluster i, and c, is the centroid of cluster i. The process stops when 
the difference between the objective function values of two consecutive 
iterations of the A:-means algorithm is bounded by some small value e. 

Note that A:-means is highly sensitive to the initial k centroids, and 
different clustering results can be obtained on a single dataset depending 




130 


Data Mining Essentials 


on the initial k centroids. This problem can be mitigated by running /f-means 
multiple times and selecting the clustering assignment that is observed most 
often or is more desirable based on an objective function, such as the squared 
error. Since /r-mcans assumes that instances that belong to the same cluster 
are the ones that found the cluster’s centroid closer than other centroids in 
the dataset, one can safely assume that all the instances inside a cluster fall 
into a hyper-sphere, with the centroid being its center. The radius for this 
hyper-sphere is defined based on the farthest instance inside this cluster. 
If, when clusters that need to be extracted are nonspherical (globular), k- 
means has problems detecting them. This problem can be addressed by a 
preprocessing step in which a transformation is performed on the dataset 
to solve this issue. 

5.5.2 Unsupervised Learning Evaluation 

When clusters are found, there is a need to evaluate how accurately the 
task has been performed. When ground truth is available, we have prior 
knowledge of which instances should belong to which cluster, as discussed 
in Chapter 6 in detail. However, evaluating clustering is a challenge because 
ground truth is often not available. When ground truth is unavailable, 
we incorporate techniques that analyze the clusters found and describe the 
quality of clusters found. In particular, we can use techniques that measure 
cohesiveness or separateness of clusters. 

Cohesiveness 

In evaluating clustering, we are interested in clusters that exhibit cohesive¬ 
ness. In cohesive clusters, instances inside the clusters are close to each 
other. In statistical terms, this is equivalent to having a small standard devi¬ 
ation (i.e., being close to the mean value). In clustering, this translates to 
being close to the centroid of the cluster. So cohesiveness is defined as the 
distance from instances to the centroid of their respective clusters, 

k n(i) 

cohesiveness = EE \\Xj - C/ll , ( 5 . 56 ) 

,'=i j= i 

which is the squared distance error (also known as SSE) discussed pre¬ 
viously. Small values of cohesiveness denote highly cohesive clusters in 
which all instances are close to the centroid of the cluster. 

Example 5.7. Figure 5.8 shows a dataset of four one-dimensional 
instances. The instances are clustered into two clusters. Instances in cluster 
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Figure 5.8. Unsupervised Learning Evaluation. 


1 are x\ and x\, and instances in cluster 2 are x\ and x\. The centroids 
of these two clusters are denoted as c\ and ci- For these two clusters, the 
cohesiveness is 

cohesiveness = | — 10 — (—7.5)[ 2 + | - 5 - (—7.5)| 2 + |5 - 7.5)| 2 


+ | 10 — 7.5| 2 = 25. 


(5.57) 


Separateness 

We are also interested in clustering of the data that generates clusters that 
are well separated from one another. To measure this distance between 
clusters, we can use the separateness measure. In statistics, separateness 
can be measured by standard deviation. Standard deviation is maximized 
when instances are far from the mean. In clustering terms, this is equivalent 
to cluster centroids being far from the mean of the entire dataset: 

k 

separateness = ||c — c, |[ 2 , (5.58) 

i =1 

where c = \ Ym= i x i is the centroid of all instances and c, is the centroid 
of cluster i. Large values of separateness denote clusters that are far apart. 

Example 5.8. For the dataset shown in Figure 5.8, the centroid for all 
instances is denoted as c. For this dataset, the separateness is 

separateness = | — 7.5 — 0| 2 + 17.5 — 0[ 2 = 112.5. (5.59) 

In general, we are interested in clusters that are both cohesive and sepa¬ 
rate. The silhouette index combines both these measures. 


Silhouette Index 

The silhouette index combines both cohesiveness and separateness. It com¬ 
pares the average distance value between instances in the same cluster and 
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the average distance value between instances in different clusters. In a well- 
clustered dataset, the average distance between instances in the same cluster 
is small (cohesiveness), and the average distance between instances in dif¬ 
ferent clusters is large (separateness). Let a(x) denote the average distance 
between instance x of cluster C and all other members of C: 

a(x) = /_ \\ x ~y\\ 2 - ( 5 - 60 ) 

1 yeC,y^x 


Let G ^ C denote the cluster that is closest to x in terms of the average 
distance between x and members of G. Let b{x ) denote the average distance 
between instance x and instances in cluster G\ 


b(x ) = min-I lx — y 11 2 

G^C \G\ ^ 

yeG 


(5.61) 


Since we want distance between instances in the same cluster to be 
smaller than distance between instances in different clusters, we are inter¬ 
ested in a(x ) < b(x). The silhouette clustering index is formulated as 


s(x) = 


b(x) — a(x ) 
max(6(x), a{x)) ’ 


silhouette = 


1 

n 


^s(x). 


(5.62) 

(5.63) 


The silhouette index takes values between [—1, 1]. The best clustering 
happens when Vx a(x ) <<C b(x). In this case, silhouette ~ 1. Similarly when 
silhouette < 0, that indicates that many instances are closer to other clusters 
than their assigned cluster, which shows low-quality clustering. 

Example 5.9. In Figure 5.8, the a(.), b(.), and s(.) values are 


a(x\) = 

| - 10 — (—5)[ 2 = 25 

(5.64) 

b(x\) = 

^(| - 10 - 5| 2 + | - 10 - 10| 2 ) = 312.5 

(5.65) 

s(x|) = 

312.5-25 

= 0.92 

312.5 

(5.66) 

a{x\) = 

| - 5 - (-10)[ 2 = 25 

(5.67) 

b{x\) = 

l(|-5 —5| 2 +| — 5—10| 2 )= 162.5 

(5.68) 

s(x l 2 ) = 

162.5 - 25 

= 0.84 

162.5 

(5.69) 
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a(x\) = |5 - 10| 2 = 25 (5.70) 

b(x 2 ) = ^(|5 — (—10)| 2 + |5 — (—5)| 2 ) = 162.5 (5.71) 

2) = 162 ' 5 ~ 25 = o.84 (5.72) 

v 162.5 v 

a(x 2 2 ) = 110 — 5| 2 = 25 (5.73) 

b(x 2 2 ) = |(|10 - (-5)| 2 + |10 - (-10)| 2 ) = 312.5 (5.74) 

. 312.5-25 

s(xt) = -= 0.92. (5.75) 

v 2> 312.5 v 

Given the s(.) values, the silhouette index is 

1 

silhouette = -(0.92 + 0.84 + 0.84 + 0.92) = 0.88. (5.76) 

5.6 Summary 

This chapter covered data mining essentials. The general process for ana¬ 
lyzing data is known as knowledge discovery in databases (KDD). The first 
step in the KDD process is data representation. Data instances are repre¬ 
sented in tabular format using features. These instances can be labeled or 
unlabeled. There exist different feature types: nominal, ordinal, interval, 
and ratio. Data representation for text data can be performed using the vec¬ 
tor space model. After resolving representation, quality measures need to 
be addressed and preprocessing steps completed before processing the data. 
Quality measures include noise removal, outlier detection, missing values 
handling, and duplicate data removal. Preprocessing techniques commonly 
performed are aggregation, discretization, feature selection, feature extrac¬ 
tion, and sampling. 

We covered two categories of data mining algorithms: supervised and 
unsupervised learning. Supervised learning deals with mapping feature val¬ 
ues to class labels, and unsupervised learning is the unsupervised division 
of instances into groups of similar objects. 

When labels are discrete, the supervised learning is called classification, 
and when labels are real numbers, it is called regression. We covered, these 
classification methods: decision tree learning, naive Bayes classifier (NBC), 
nearest neighbor classifier, and classifiers that use network information. We 
also discussed linear and logistic regression. 
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To evaluate supervised learning, a training-testing framework is used 
in which the labeled dataset is partitioned into two parts, one for training 
and the other for testing. Different approaches for evaluating supervised 
learning such as leave-one-out or /r-fold cross validation were discussed. 

Any clustering algorithm requires the selection of a distance measure. 
We discussed partitional clustering algorithms and /i'-mcans from these 
algorithms, as well as methods of evaluating clustering algorithms. To 
evaluate clustering algorithms, one can use clustering quality measures such 
as cohesiveness, which measures how close instances are inside clusters, or 
separateness, which measures how separate different clusters are from one 
another. Silhouette index combines the cohesiveness and separateness into 
one measure. 


5.7 Bibliographic Notes 

A general review of data mining algorithms can be found in the machine 
learning and pattern recognition [Bishop, 2006; Duda, Hart, and Stork, 
2012; Mitchell, 1997; Quinlan, 1986, 1993; Langley, 1995], data mining 
[Friedman et ah, 2009; Han et al., 2006; Witten et ah, 2011; Tan et ah, 
2005; Han et ah, 2006], and pattern recognition [Bishop, 1995; Richard 
et ah, 2001] literature. 

Among preprocessing techniques, feature selection and feature extrac¬ 
tion have gained much attention due to their importance. General references 
for feature selection and extraction can be found in [Liu and Motoda, 1998; 
Dash and Liu, 1997, 2000; Guyon, 2006; Zhao and Liu, 2011; Liu and 
Motoda, 1998; Liu and Yu, 2005], Feature selection has also been discussed 
in social media data in [Tang and Liu, 2012a,b, 2013], Although not much 
research is dedicated to sampling in social media, it plays an important 
role in the experimental outcomes of social media research. Most experi¬ 
ments are performed using sampled social media data, and it is important 
for these samples to be representative samples of the site that is under 
study. For instance, Morstatter et ah [2013] studied whether Twitter’s heav¬ 
ily sampled Streaming API, a free service for social media data, accurately 
portrays the true activity on Twitter. They show that the bias introduced by 
the Streaming API is significant. 

In addition to the data mining categories covered in this chapter, there 
are other important categories in the area of data mining and machine 
learning. In particular, an interesting category is semi-supervised learn¬ 
ing. In semi-supervised learning, the label is available for some instances, 
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but not all. The model uses the labeled information and the feature distri¬ 
bution of the unlabeled data to learn a model. Expectation maximization 
(EM) is a well-established technique from this area. In short, EM learns 
a model from the data that is partially labeled (expectation step). Then, it 
uses this model to predict labels for the unlabeled instances (maximization 
step). The predicted labels for instances are used once again to refine the 
learned model and revise predictions for unlabeled instances in an iterative 
fashion until convergence in reached. In addition to supervised methods 
covered, neural networks deserve mention [Haykin and Network, 2004], 
More on regression techniques in available in [Neter et al., 1996; Bishop, 
2006], 

Clustering is one of the most popular areas in the field of machine 
learning research. A taxonomy of clustering algorithms can be found in 
[Berkhin, 2006; Jain et al., 1999; Xu and Wunsch, 2005; Mirkin, 2005], 
Among clustering algorithms, some of which use data density of cluster 
data, DBSCAN [Ester et al., 1996], GDBSCAN [Sander et al., 1998], 
CLARANS [Ng and Han, 1994], and OPTICS [Ankerst et al., 1999] are 
some of the most well known and practiced algorithms. Most of the previous 
contributions in the area of clustering consider the number of clusters as an 
input parameter. Early literature in clustering had attempted to solve this 
by running algorithms for several K s (number of clusters) and selecting 
the best K that optimizes some coefficients [Milligan and Cooper, 1985; 
Berkhin, 2006]. For example, the distance between two cluster centroids 
normalized by a cluster’s standard deviation could be used as a coefficient. 
After the coefficient is selected, the coefficient values are plotted as a 
function of K (number of clusters) and the best K is selected. 

An interesting application of data mining is sentiment analysis in which 
the level of subjective content in information is quantified; for exam¬ 
ple, identifying the polarity (i.e., being positive/negative) of a digital 
camera review. General references for sentiment analysis can be found 
in [Pang and Lee, 2008; Liu, 2007], and examples of recent developments 
in social media are available in [Hu et al., 2013a,b]. 

5.8 Exercises 

1. Describe how methods from this chapter can be applied in social 
media. 

2. Outline a framework for using the supervised learning algorithm for 
unsupervised learning. 
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Data 


3. Describe methods that can be used to deal with missing data. 

4. Given a continuous attribute, how can we convert it to a discrete 
attribute? How can we convert discrete attributes to continuous ones? 

5. If you had the chance of choosing either instance selection or feature 
selection, which one would you choose? Please justify. 

6. Given two text documents that are vectorized, how can we measure 
document similarity? 

7. In the example provided for TF-IDF (Example 5.1), the word “orange” 
received zero score. Is this desirable? What does a high TF-IDF value 
show? 


Supervised Learning 


8. Provide a pseudocode for decision tree induction. 

9. How many decision trees containing n attributes and a binary class can 

be generated? 

10. What does zero entropy mean? 

11. • What is the time complexity for learning a naive Bayes classifer? 

• What is the time complexity for classifying using the naive Bayes 
classifier? 

• Linear separability: Two sets of two-dimensional instances are 
linearly separable if they can be completely separated using one 
line. In //-dimensional space, two set of instances are linearly 
separable if one can separate them by a hyper-plane. A classical 
example of nonlinearity is the XOR function. In this function, the 
two instance sets are the black-and-white instances (see Figure 5.9), 
which cannot be separated using a single line. This is an example 
of a nonlinear binary function. Can a naive Bayes classifier learn 
nonlinear binary functions? Provide details. 

• What about linear separability and K -NN? Are K -NNs capable of 
solving such problems? 



( 0 , 0 ) ■—-—□—► 
' ( 1 , 0 ) 


Figure 5.9. Nonlinearity of XOR Function. 
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12. Describe how the least square solution can be determined for 
regression. 

Unsupervised Learning 

13. (a) Given k clusters and their respective cluster sizes s\, S 2 , ..., Sk, 

what is the probability that two random (with replacement) data 
vectors (from the clustered dataset) belong to the same cluster? 

(b) Now, assume you are given this probability (you do not have s,-’s 
and k), and the fact that clusters are equally sized, can you find k? 
This gives you an idea how to predict the number of clusters in a 
dataset. 

14. Give an example of a dataset consisting of four data vectors where there 
exist two different optimal (minimum SSE) 2-means (A-means, A = 2) 
clusterings of the dataset. 

• Calculate the optimal SSE value for your example. 

• In general, how should datasets look like geometrically so that we 
have more than one optimal solution? 

• What defines the number of optimal solutions? 

Perform two iterations of the A-means algorithm in order to obtain two 
clusters for the input instances given in Table 5.4. Assume that the first 
centroids are instances 1 and 3. Explain if more iterations are needed 
to get the final clusters. 


Table 5.4. Dataset 


Instance 

X 

Y 

1 

12.0 

15.0 

2 

12.0 

33.0 

3 

18.0 

15.0 

4 

18.0 

27.0 

5 

24.0 

21.0 

6 

36.0 

42.0 


16. What is the usual shape of clusters generated by A--means? Give an 
example of cases where A-means has limitations in detecting the pat¬ 
terns formed by the instances. 

17. Describe a preprocessing strategy that can help detect nonspherical 
clusters using A-means. 
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In November 2010, a team of dutch law enforcement agents dismantled 
a community of 30 million infected computers across the globe that were 
sending more than 3.6 billion daily spam mails. These distributed networks 
of infected computers are called botnets. The community of computers in 
a botnet transmit spam or viruses across the web without their owner’s 
permission. The members of a botnet are rarely known; however, it is vital 
to identify these botnet communities and analyze their behavior to enhance 
internet security. This is an example of community analysis. In this chapter, 
we discuss community analysis in social media. 

Also known as groups, clusters, or cohesive subgroups, communities 
have been studied extensively in many fields and, in particular, the social sci¬ 
ences. In social media mining, analyzing communities is essential. Studying 
communities in social media is important for many reasons. First, individu¬ 
als often form groups based on their interests, and when studying individu¬ 
als, we are interested in identifying these groups. Consider the importance 
of finding groups with similar reading tastes by an online book seller for 
recommendation purposes. Second, groups provide a clear global view of 
user interactions, whereas a local-view of individual behavior is often noisy 
and ad hoc. Finally, some behaviors are only observable in a group setting 
and not on an individual level. This is because the individual’s behavior can 
fluctuate, but group collective behavior is more robust to change. Consider 
the interactions between two opposing political groups on social media. 
Two individuals, one from each group, can hold similar opinions on a sub¬ 
ject, but what is important is that their communities can exhibit opposing 
views on the same subject. 

In this chapter, we discuss communities and answer the following three 
questions in detail: 

1. How can we detect communities? This question is discussed in dif¬ 
ferent disciplines, but in diverse forms. In particular, quantization in 
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electrical engineering, discretization in statistics, and clustering in 
machine learning tackle a similar challenge. As discussed in Chapter 
5, in clustering, data points are grouped together based on a simi¬ 
larity measure. In community detection, data points represent actors 
in social media, and similarity between these actors is often defined 
based on the interests these users share. The major difference between 
clustering and community detection is that in community detection, 
individuals are connected to others via a network of links, whereas 
in clustering, data points are not embedded in a network. 

2. How do communities evolve and how can we study evolving com¬ 
munities? Social media forms a dynamic and evolving environment. 
Similar to real-world friendships, social media interactions evolve 
over time. People join or leave groups; groups expand, shrink, dis¬ 
solve, or split over time. Studying the temporal behavior of commu¬ 
nities is necessary for a deep understanding of communities in social 
media. 

3. How can we evaluate detected communities? As emphasized in our 
botnet example, the list of community members (i.e., ground truth) 
is rarely known. Hence, community evolution is a challenging task 
and often means to evaluating detected communities in the absence 
of ground truth. 


Social Communities 

Broadly speaking, a real-world community is a body of individuals with 
common economic, social, or political interests/characteristics, often living 
in relative proximity. A virtual community comes into existence when like- 
minded users on social media form a link and start interacting with each 
other. In other words, formation of any community requires (1) a set of at 
least two nodes sharing some interest and (2) interactions with respect to 
that interest. 

As a real-world community example, consider the interactions of a col¬ 
lege karate club collected by Wayne Zachary in 1977. The example is often 
referred to as Zachary’s Karate Club [Zachary, 1977] in the literature. Fig¬ 
ure 6.1 depicts the interactions in a college karate club over two years. The 
links show friendships between members. During the observation period, 
individuals split into two communities due to a disagreement between the 
club administrator and the karate instructor, and members of one commu¬ 
nity left to start their own club. In this figure, node colors demonstrate the 
communities to which individuals belong. As observed in this figure, using 
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Figure 6.1. Zachary’s Karate Club. Nodes represent karate club members and edges 
represent friendships. A conflict in the club divided the members into two groups. The 
color of the nodes denotes which one of the two groups the nodes belong to. 


graphs is a convenient way to depict communities because color-coded 
nodes can denote memberships and edges can be used to denote relations. 
Furthermore, we can observe that individuals are more likely to be friends 
with members of their own group, hence, creating tightly knit components 
in the graph. 

Zachary’s Karate Club is an example of two explicit communities. An 
explicit community, also known as an emic community, satisfies the fol¬ 
lowing three criteria: 

1. Community members understand that they are its members. 

2. Nomnembers understand who the community members are. 

3. Community members often have more interactions with each other 
than with nonmembers. 

In contrast to explicit communities, in implicit communities, also known 
as etic communities, individuals tacitly interact with others in the form of an 
unacknowledged community. For instance, individuals calling Canada from 
the United States on a daily basis need not be friends and do not consider 
each other as members of the same explicit community. Flowever, from the 
phone operator’s point of view, they form an implicit community that needs 
to be marketed the same promotions. Finding implicit communities is of 
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major interest, and this chapter focuses on finding these communities in 
social media. 

Communities in social media are more or less representatives of com¬ 
munities in the real world. As mentioned, in the real world, members of 
communities are often geographically close to each other. The geographi¬ 
cal location becomes less important in social media, and many communities 
on social media consist of highly diverse people from all around the planet. 
In general, people in real-world communities tend to be more similar than 
those of social media. People do not need to share language, location, and 
the like to be members of social media communities. Similar to real-world 
communities, communities in social media can be labeled as explicit or 
implicit. Examples of explicit communities in well-known social media 
sites include the following: 

• Facebook. In Facebook, there exist a variety of explicit communities, 
such as groups and communities. In these communities, users can post 
messages and images, comment on other messages, like posts, and 
view activities of others. 

• Yahoo! Groups. In Yahoo! groups, individuals join a group mailing 
list where they can receive emails from all or a selection of group 
members (administrators) directly. 

• Linkedln. Linkedln provides its users with a feature called Groups 
and Associations. Users can join professional groups where they can 
post and share information related to the group. 

Because these sites represent explicit communities, individuals have an 
understanding of when they are joining them. However, there exist implicit 
communities in social media as well. For instance, consider individuals 
with the same taste for certain movies on a movie rental site. These indi¬ 
viduals are rarely all members of the same explicit community. However, 
the movie rental site is particularly interested in finding these implicit com¬ 
munities so it can better market to them by recommending movies similar 
to their tastes. We discuss techniques to find these implicit communities 
next. 


6.1 Community Detection 

As mentioned earlier, communities can be explicit (e.g., Yahoo! groups), 
or implicit (e.g., individuals who write blogs on the same or similar topics). 
In contrast to explicit communities, in many social media sites, implicit 
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communities and their members are obscure to many people. Community 
detection finds these implicit communities. 

In the simplest form, similar to the graph shown in Figure 6.1, com¬ 
munity detection algorithms are often provided with a graph where nodes 
represent individuals and edges represent friendships between individual. 
This definition can be generalized. Edges can also be used to represent 
contents or attributes shared by individuals. For instance, we can connect 
individuals at the same location, with the same gender, or who bought the 
same product using edges. Similarly, nodes can also represent products, 
sites, and webpages, among others. Formally, for a graph G{ V, E), the task 
of community detection is to find a set of communities {C, }" =1 in a G such 
thatU" =1 C,' c V. 


6.1.1 Community Detection Algorithms 

There are a variety of community detection algorithms. When detecting 
communities, we are interested in detecting communities with either (1) 
specific members or (2) specific forms of communities. We denote the for¬ 
mer as member-based community detection and the latter as group-based 
community detection. Consider the network of 10 individuals shown in Fig¬ 
ure 6.2 where 7 are wearing black t-shirts and 3 are wearing white ones. If we 
group individuals based on their t-shirt color, we end up having a community 
of three and a community of seven. This is an example of member-based 
community detection, where we are interested in specific members charac¬ 
terized by their t-shirts’ color. If we group the same set based on the density 
of interactions (i.e., internal edges), we get two other communities. This is 
an instance of group-based community detection, where we are interested 
in specific communities characterized by their interactions’ density. 

Member-based community detection, uses community detection algo¬ 
rithms that group members based on attributes or measures such as simi¬ 
larity, degree, or reachability. In group-based community detection, we are 
interested in finding communities that are modular, balanced, dense, robust, 
or hierarchical. 


6.1.2 Member-Based Community Detection 

The intuition behind member-based community detection is that members 
with the same (or similar) characteristics are more often in the same commu¬ 
nity. Therefore, a community detection algorithm following this approach 
should assign members with similar characteristics to the same community. 
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I 


Member-Based 
Community Detection 

Figure 6.2. Community Detection Algorithms Example. Member-based community 
detection groups members based on their characteristics. Here, we divide the network 
based on color. In group-based community detection, we find communities based on 
group properties. Here, groups are formed based on the density of interactions among 
their members. 


Let us consider a simple example. We can assume that nodes that belong to a 
cycle form a community. This is because they share the same characteristic: 
being in the cycle. Figure 6.3 depicts a 4-cycle. For instance, we can search 
for all n-cycles in the graph and assume that they represent a community. 
The choice for n can be based on empirical evidence or heuristics, or n can 
be in a range [a i, ai] for which all cycles are found. A well-known example 
is the search for 3-cycles (triads) in graphs. 


© 

—© 

©— 

—© 


Figure 6.3. A 4-Cycle. 
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Figure 6.4. First Four Complete Graphs. 


In theory, any subgraph can be searched for and assumed to be a commu¬ 
nity. In practice, only subgraphs that have nodes with specific characteristics 
are considered as communities. Three general node characteristics that are 
frequently used are node similarity, node degree (familiarity’), and node 
reachibility. 

When employing node degrees, we seek subgraphs, which are often con¬ 
nected, such that each node (or a subset of nodes) has a certain node degree 
(number of incoming or outgoing edges). Our 4-cycle example follows 
this property, the degree of each node being two. In reachability, we seek 
subgraphs with specific properties related to paths existing between nodes. 
For instance, our 4-cycle instance also follows the reachability characteris¬ 
tic where all pairs of nodes can be reached via two independent paths. In 
node similarity, we assume nodes that are highly similar belong to the same 
community. 


Node Degree 

The most common subgraph searched for in networks based on node degrees 
is a clique. A clique is a maximum complete subgraph in which all pairs 
of nodes inside the subgraph are connected. In terms of the node degree 
characteristic, a clique of size k is a subgraph of k nodes where all node 
degrees in the induced subgraph are k — 1. The only difference between 
cliques and complete graphs is that cliques are subgraphs, whereas complete 
graphs contain the whole node set V. The simplest four complete graphs 
(or cliques, when these are subgraphs) are represented in Figure 6.4. 

To find communities, we can search for the maximum clique (the one with 
the largest number of vertices) or for all maximal cliques (cliques that are 
not subgraphs of a larger clique; i.e., cannot be expanded further). However, 
both problems are NP-hard, as is verifying whether a graph contains a clique 
larger than size k. To overcome these theoretical barriers, for sufficiently 
small networks or subgraphs, we can (1) use brute force, (2) add some 
constraints such that the problem is relaxed and polynomially solvable, or 
(3) use cliques as the seed or core of a larger community. 
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Algorithm 6.1 Brute-Force Clique Identification 

Require: Adjacency Matrix A, Vertex o x 
l: return Maximal Clique C containing o x 
2 : CliqueQueue = {{o x }}; 

3: while CliqueQueue has not changed do 
4: C=pop(CliqueQueue); 

5: vi as t = Last node added to C; 

6: Niviast) = {u, \A vlasl V . = 1}. 

7: for all v temp e N(v, ast ) do 

8: if C IJ{ vtemp} is a clique then 

9: push(CliqueQueue, C \J{v, emp }); 

10 : end if 

li: end for 

12 : end while 

13: Return the largest clique from CliqueQueue 


Brute-force clique identification. The brute force method can find all 
maximal cliques in a graph. For each vertex v x , we try to find the maxi¬ 
mal clique that contains node o x . The brute-force algorithm is detailed in 
Algorithm 6.1. 

The algorithm starts with an empty queue of cliques. This queue is 
initialized with the node v x that is being analyzed (a clique of size 1). Then, 
from the queue, a clique is popped (C). The last node added to clique C 
is selected (vi ast ). All the neighbors of vi ast are added to the popped clique 
C sequentially, and if the new set of nodes creates a larger clique (i.e., the 
newly added node is connected to all of the other members), then the new 
clique is pushed back into the queue. This procedure is followed until nodes 
can no longer be added. 

The brute-force algorithm becomes impractical for large networks. For 
instance, for a complete graph of only 100 nodes, the algorithm will generate 
2 10u different cliques starting from any node in the graph (why?). 

The performance of the brute-force algorithm can be enhanced by prun¬ 
ing specific nodes and edges. If the cliques being searched for are of size 
k or larger, we can simply assume that the clique, if found, should contain 
nodes that have degrees equal to or more than k — 1. We can first prune all 
nodes (and edges connected to them) with degrees less than k — 1. Due to 
the power-law distribution of node degrees, many nodes exist with small 
degrees (1, 2, etc.). Hence, for a large enough k many nodes and edges 
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1- plex :{v 2 ,v 3 ,v 4 ,v 5 } 

2- plex :{vj,v 2 ,v 3 ,v 4 ,v 5 },{v 2 ,v 3 ,v 4 ,v 5 ,v 6 } 

3- plex :{v 1 ,v 2 ,v 3 ,v 4 ,v s ,v 6 } 

Figure 6.5. Maximal A-plexes for k = 1,2, and 3. 

will be pruned, which will reduce the computation drastically. This pruning 
works for both directed and undirected graphs. 

Even with pruning, there are intrinsic properties with cliques that make 
them a less desirable means for finding communities. Cliques are rarely 
observed in the real world. For instance, consider a clique of 1,000 nodes. 
This subgraph has 999 x 2 1000 = 499,5 00 edges. A single edge removal from 
this many edges results in a subgraph that is no longer a clique. That 
represents less than 0.0002% of the edges, which makes finding cliques an 
unlikely and challenging task. 

In practice, to overcome this challenge, we can either relax the clique 
structure or use cliques as a seed or core of a community. 

Relaxing cliques. A well-known clique relaxation that comes from sociol¬ 
ogy is the A-plcx concept. In a clique of size k, all nodes have the degree 
of k — 1; however, in a A'-plex, all nodes have a minimum degree that is not 
necessarily k — 1 (as opposed to cliques of size k). For a set of vertices V, 
the structure is called a k-plex if we have 

d v > \V\-k,Vv e V, (6.1) 

where d v is the degree of v in the induced subgraph (i.e., the number of 
nodes from the set V that are connected to v). 

Clearly, a clique of size k is a 1-plex. As k gets larger in a A'-plcx, the 
structure gets increasingly relaxed, because we can remove more edges 
from the clique structure. Finding the maximum &-plex in a graph still 
tends to be NP-hard, but in practice, finding it is relatively easy due to 
smaller search space. Figure 6.5 shows maximal A>plexes for k = 1,2, and 
3. A A'-plcx is maximal if it is not contained in a larger (i.e., with more 
nodes). 

Using cliques as a seed of a community. When using cliques as a seed 
or core of a community, we assume communities are formed from a set of 
cliques (small or large) in addition to edges that connect these cliques. A 
well-known algorithm in this area is the clique percolation method (CPM) 



A-PLEX 


CLIQUE 

PERCOLATION 

METHOD 

(CPM) 
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Algorithm 6.2 Clique Percolation Method (CPM) 

Require: parameter k 
l : return Overlapping Communities 
2 : Cliquesk = find all cliques of size k 
3: Construct clique graph G(V, E), where \ V\ = \Cliques k \ 
4: E = {dj | clique i and clique j share k — I nodes} 

5: Return all connected components of G 


Palla et al. [2005]. The algorithm is provided in Algorithm 6.2. Given 
parameter k, the method starts by finding all cliques of size k. Then a graph 
is generated (clique graph) where all cliques are represented as nodes, 
and cliques that share k — 1 vertices are connected via edges. Communi¬ 
ties are then found by reporting the connected components of this graph. 
The algorithm searches for all cliques of size k and is therefore compu¬ 
tationally intensive. In practice, when using the CPM algorithm, we often 
solve CPM for a small k. Relaxations discussed for cliques are desirable to 
enable the algorithm to perform faster. Lastly, CPM can return overlapping 
communities. 

Example 6.1. Consider the network depicted in Figure 6.6(a). The cor¬ 
responding clique graph generated by the CPM algorithm for k = 3 is 
provided in Figure 6.6(b). All cliques of size k = 3 have been identi¬ 
fied and cliques that share k — 1=2 nodes are connected. Connected 
components are returned as communities ({v\, V 2 , U 3 }, (t>8, 09, «io}> cmd 
{«3, t>4, U5, 06, 07, os}/ Nodes 03 and vg belong to two communities, and 
these communities are overlapping. 


Node Reachability 

When dealing with reachability, we are seeking subgraphs where nodes are 
reachable from other nodes via a path. The two extremes of reachability 
are achieved when nodes are assumed to be in the same community if 
(1) there is a path between them (regardless of the distance) or (2) they 
are so close as to be immediate neighbors. In the first case, any graph 
traversal algorithm such as BFS or DFS can be used to identify connected 
components (communities). However, finding connected components is 
not very useful in large social media networks. These networks tend to 
have a large-scale connected component that contains most nodes, which 
are connected to each other via short paths. Therefore, finding connected 
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(b) CPM Clique Graph 

Figure 6.6. Clique Percolation Method (CPM) Example for k = 3. 


components is less powerful for detecting communities in them. In the 
second case, when nodes are immediate neighbors of all other nodes, cliques 
are formed, and as discussed previously, finding cliques is considered a very 
challenging process. 

To overcome these issues, we can find communities that are in between 
cliques and connected components in terms of connectivity and have small 
shortest paths between their nodes. There are predefined subgraphs, with 
roots in social sciences, with these characteristics. Well-known ones include 
the following: 

• /t-Cliquc is a maximal subgraph where the shortest path between 
any two nodes is always less than or equal to k. Note that in k- 
cliques, nodes on the shortest path should not necessarily be part of 
the subgraph. 


^-CLIQUE, 
i-CLUB, AND 
/t-CLAN 
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STRUCTURAL 

EQUIVALENCE 


2-cliques :{v 1 ,v 2 ,v 3 ,v 4 ,v 5 },{v 2 ,v 3 ,v 4 ,v 5 ,v 6 } 

2-clubs :{v 2 ,v 3 ,v 4 ,v s ,v 6 },{v 1 ,v 2 ,v 3 ,v 4 },{v 1 ,v 2 ,v 3 ,v s 
2-clans :{v 2 ,v 3 ,v 4 ,v 5 ,v 6 } 

Figure 6.7. Examples of 2-Cliques, 2-Clubs, and 2-Clans. 

• £-Club is a more restricted definition; it follows the same definition 
as ^-cliques with the additional constraint that nodes on the shortest 
paths should be part of the subgraph. 

• A-Clan is a ^-clique where, for all shortest paths within the subgraph, 
the distance is less than or equal to k. All A'-clans are A-cliqucs and 
A-clubs, but not vice versa. In other words, 

A'-clans = A'-Cliques D A'-clubs. 

Figure 6.7 depicts an example of the three discussed models. 



Node Similarity 

Node similarity attempts to determine the similarity between two nodes i>, 
and vj. Similar nodes (or most similar nodes) are assumed to be in the 
same community. The question has been addressed in different fields; in 
particular, the problem of structural equivalence in the field of sociology 
considers the same problem. In structural equivalence, similarity is based 
on the overlap between the neighborhood of the vertices. Let Nipi) and 
N(vj) be the neighbors of vertices o t and oj, respectively. In this case, a 
measure of vertex similarity can be defined as follows: 

a(Vi, vj) = \N(vi) Cl N(vj)\. (6.2) 

For large networks, this value can increase rapidly, because nodes may 
share many neighbors. Generally, similarity is attributed to a value that is 
bounded and usually in the range [0, 1]. For that to happen, various normal¬ 
ization procedures such as the Jaccard similarity or the cosine similarity 
can be done: 


|tf( D< )ntf(o ; )| 

ffJaccaidlD.,0,) ^(p.) U ’ 

(6.3) 

, , |N(u ; )niV( 0y )| 

^Cos lne (G,^) v / |7V(W; . ) || A , (w . )| - 

(6.4) 
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Example 6.2. Consider the graph in Figure 6.7. The similarity’ values 
between nodes 02 and v$ are 


^Jaccard(^2? t) 5) 


|{»1,P3,U 4 } n {»3,»6}| 
\{d U V3,V 4 , 0 6 }| 


0.25, 


0’Cosine(t>2, W 5 ) — 


|{?Jl,P3,»4}n{0 3 ,0 6 }| 
VI{Ul, »3, W 4 }||{»3, well 


0.40. 


(6.5) 

( 6 . 6 ) 


In general, the definition of neighborhood N(di) excludes the node itself 
(vj). This, however, leads to problems with the aforementioned similarity 
values because nodes that are connected and do not share a neighbor will 
be assigned zero similarity. This can be rectified by assuming that nodes 
are included in their own neighborhood. 

A generalization of structural equivalence is known as regular equiv¬ 
alence. Consider the situation of two basketball players in two different 
countries. Though sharing no neighborhood overlap, the social circles of 
these players (coach, players, fans, etc.) might look quite similar due to 
their social status. In other words, nodes are regularly equivalent when 
they are connected to nodes that are themselves similar (a self-referential 
definition). For more details on regular equivalence, refer to Chapter 3. 


6.1.3 Group-Based Community Detection 

When considering community characteristics for community detection, we 
are interested in communities that have certain group properties. In this 
section, we discuss communities that are balanced, robust, modular, dense, 
or hierarchical. 


Balanced Communities 

As mentioned before, community detection can be thought of as the problem 
of clustering in data mining and machine learning. Graph-based clustering 
techniques have proven to be useful in identifying communities in social 
networks. In graph-based clustering, we cut the graph into several partitions 
and assume these partitions represent communities. 

Formally, a cut in a graph is a partitioning (cut) of the graph into two 
(or more) sets (cutsets). The size of the cut is the number of edges that are 
being cut and the summation of weights of edges that are being cut in a 
weighted graph. A minimum cut (min-cut) is a cut such that the size of the 
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Figure 6.8. Minimum Cut (A) and Two More Balanced Cuts (B and C) in a Graph. 


MINIMUM 

CUT 


RATIO CUT 
AND 

NORMALIZED 

CUT 


cut is minimized. Figure 6.8 depicts several cuts in a graph. For example, 
cut B has size 4, and A is the minimum cut. 

Based on the well-known max-flow min-cut theorem, the minimum cut 
of a graph can be computed efficiently. Flowever, minimum cuts are not 
always preferred for community detection. Often, they result in cuts where a 
partition is only one node (singleton), and the rest of the graph is in the other. 
Typically, communities with balanced sizes are preferred. Figure 6.8 depicts 
an example where the minimum cut (A) creates unbalanced partitions, 
whereas, cut C is a more balanced cut. 

To solve this problem, variants of minimum cut define an objec¬ 
tive function, minimizing (or maximizing) that during the cut-finding 
procedure, results in a more balanced and natural partitioning of the 
data. Consider a graph G(V, E). A partitioning of G into k partitions 
is a tuple P = (Pi, P 2 , P 2 , ..., Pfl), such that P t C V, P t fT Pj = 0 and 
Uf =1 Pi = V . Then, the objective function for the ratio cut and normalized 
cut are defined as follows: 


DU r 1 Pi) 

Rat,o Cut, fl| ■ 

(6.7) 

Normalized Cut(P) = V " , 

k “ vol(.P, ) 

(6.8) 


where P, = V — P, is the complement cut set, cut(Pj, Pj) is the size of 
the cut, and volume vol(Pj) = X^oeP, d v . Both objective functions provide 
a more balanced community size by normalizing the cut size either by the 
number of vertices in the cutset or the volume (total degree). 

Both the ratio cut and normalized cut can be formulated in a matrix for¬ 
mat. Let matrix X e {0, l} 1 ’ |xA ’ denote the community membership matrix, 
where Xjj = 1 if node i is in community j; otherwise, Xjj = 0. Let 
D = diag(z/i, d 2 ,... ,d n ) represent the diagonal degree matrix. Then the 
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i ,h entry on the diagonal of X T AX represents the number of edges that are 
inside community i. Similarly, the /th element on the diagonal of X T DX 
represents the number of edges that are connected to members of commu¬ 
nity i. Hence, the i"' element on the diagonal of X T ( D — A )X represents 
the number of edges that are in the cut that separates community i from all 
other nodes. In fact, the /th diagonal element of X T (D — A)X is equivalent 
to the summation term cut( P ,, P ,) in both the ratio and normalized cut. 
Thus, for ratio cut, we have 


n + - r.m l^CUt (P t ,Pi) 

RatioCut(P) = j II |fl| 

(6.9) 

1 * Xj(D - A)X t 
k y XjX t 

(6.10) 

= WxJiD-A)^, 

k i=i 

(6.11) 

where X, = X, /( Xj X,) 1 / 2 . A similar approach can be followed to for¬ 
mulate the normalized cut and to obtain a different X, . To formulate the 
summation in both the ratio and normalized cut, we can use the trace of 
matrix (tr(X) = ]C” = i Xu). Using the trace, the objectives for both the ratio 
and normalized cut can be formulated as trace-minimization problems, 

min Tr(X / LX), 

X 

(6.12) 


where L is the (normalized) graph Laplacian, defined as follows: 

JD — A Ratio Cut Laplacian (Unnormalized Laplacian)', 

1 / — D~ x l 2 AD~ x l 2 Normalized Cut Laplacian (Normalized Laplacian). 


(6.13) 


It has been shown that both ratio cut and normalized cut minimization are 
NP-hard; therefore, approximation algorithms using relaxations are desired. 
Spectral clustering is one such relaxation: 

minTr(X r ZX), (6.14) 

X 


NORMALIZED 

AND 

UNNORMALIZED 

GRAPH 

LAPLACIAN 


s.t. X T X = I k . 


(6.15) 
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SPECTRAL 

CLUSTERING 


The solution to this problem is the top eigenvectors of L. Given L, the 
top k eigenvectors corresponding to the smallest eigen values are computed 
and used as X, and then /r-means is run on X to extract communities 
memberships (X). The first eigenvector is meaningless (why?); hence, the 
rest of the eigenvectors (k — 1) are used as /f-mcans input. 

Example 6.3. Consider the graph in Figure 6.8. We find two communities 
in this graph using spectral clustering (i.e., k = 2). Then, we have 

D = diag(2, 2, 4, 4, 4, 4, 4, 3, 1). (6.16) 

The adjacency matrix A and the unnormalized laplacian L are 
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(6.18) 
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We aim to find two communities; therefore, we get two eigenvectors 
corresponding to the two smallest eigenvalues from L: 


1 

'0.33 

-0.46' 

2 

0.33 

-0.46 

3 

0.33 

-0.26 

4 

0.33 

« 0.01 

5 

0.33 

0.01 

6 

0.33 

0.13 

7 

0.33 

0.13 

8 

0.33 

0.33 

9 

0.33 

0.59 


As mentioned, the first eigenvector is meaningless, because it assigns all 
nodes to the same community. The second is used with k-means; based on 
the vector signs, we get communities {1, 2, 3} and {4, 5, 6, 7, 8, 9}. 


Robust Communities 

When seeking robust communities, our goal is to find subgraphs robust 
enough such that removing some edges or nodes does not disconnect the 
subgraph. A A'-vertex connected graph (or A'-connected) is an example of 
such a graph. In this structure, k is the minimum number of nodes that must 
be removed to disconnect the graph (i.e., there exist at least k independent 
paths between any pair of nodes). A similar subgraph is the A:-edge graph, 
where at least k edges must be removed to disconnect the graph. An upper- 
bound analysis on k-edge connectivity shows that the minimum degree for 
any node in the graph should not be less than k (why?). For example, a 
complete graph of size n is a unique n -connected graph, and a cycle is a 
2-connected graph. 


Modular Communities 

Modularity is a measure that defines how likely the community structure 
found is created at random. Clearly, community structures should be far 
from random. Consider an undirected graph G(V, E ), \E\ = m where the 
degrees are known beforehand, but edges are not. Consider two nodes v t 
and i>j , with degrees <7, and dj, respectively. What is the expected number 
of edges between these two nodes? Consider node u,. For any edge going 


Jt-CONNECTED 
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MODULARITY 


out of Vi randomly, the probability of this edge getting connected to node 
Vj is - = 2 m ■ Because the degree for u, is d ,, we have d, number of 

such edges; hence, the expected number of edges between v t and vj is 
-j - 2 -. So, given a degree distribution, the expected number of edges between 
any pair of vertices can be computed. Real-world communities are far from 
random; therefore, the more distant they are from randomly generated com¬ 
munities, the more structure they exhibit. Modularity defines this distance, 
and modularity maximization tries to maximize this distance. Consider a 
partitioning of the graph G into k partitions, P = (P\ , P2 , P3 ,..., Pty- For 
partition P x , this distance can be defined as 


E A v 

Vi,VjEP s 


didj 

2 m 


( 6 . 20 ) 


This distance can be generalized for partitioning P with k partitions, 


E E ^ 

x=l Vi,VjeP x 


didj 

2m 


( 6 . 21 ) 


The summation is over all edges (m), and because all edges are counted 
twice (Aij = Aji ), the normalized version of this distance is defined as 
modularity Newman [2006]: 


Q = 


1 

2 m 


t E *<j 

X=1 Vi,Vj€P x 



( 6 . 22 ) 


We define the modularity matrix as B = A — dd r /2 m, where d e IR" X 1 
is the degree vector for all nodes. Similar to spectral clustering matrix 
formulation, modularity can be reformulated as 

Q = E 1x(X t BX ), (6.23) 

2m 

where X e M" xA is the indicator (partition membership) function; that is, 
Xij = 1 iff. Vi e Pj. This objective can be maximized such that the best 
membership function is extracted with respect to modularity. The problem 
is NP-hard; therefore, we relax X to X that has an orthogonal structure 
(X T X = ty). The optimal X can be computed using the top k eigenvectors 
of B corresponding to the largest positive eigenvalue. Similar to spectral 
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clustering, to find X, we can run A'-means on X. Note that this requires that 
B has at least k positive eigenvalues. 


Dense Communities 

Often, we are interested in dense communities, which have sufficiently 
frequent interactions. These communities are of particular interest in social 
media where we would like to have enough interactions for analysis to 
make statistical sense. When we are measuring density in communities, 
the community may or may not be connected as long as it satisfies the 
properties required, assuming connectivity is not one such property. Cliques, 
clubs, and clans are examples of connected dense communities. Density- 
based community detection has been extensively discussed in the field of 
clustering (see Chapter 5, Bibliographic Notes). 

The density y of a graph defines how close a graph is to a clique. In 
other words, the density y is the ratio of the number of edges | E | that graph 
G has over the maximum it can have ( ): 



A graph G = ( V , E) is y-dense if |is| > y ('j 1 ). Note that a 1-dense 
graph is a clique. Here, we discuss the interesting scenario of connected 
dense graphs (i.e., quasi-cliques). A quasi-clique (or y -clique) is a con¬ 
nected y -dense graph. Quasi-cliques can be searched for using approaches 
previously discussed for finding cliques. We can utilize the brute-force 
clique identification algorithm (Algorithm 6.1) for finding quasi-cliques as 
well. The only part of the algorithm that needs to be changed is the part 
where the clique condition is checked (Line 8). This can be replaced with a 
quasi-clique checking condition. In general, because there is less regularity 
in quasi-cliques, searching for them becomes harder. Interested readers can 
refer to the bibliographic notes for faster algorithms. 


Hierarchical Communities 

All previously discussed methods have considered communities at a single 
level. In reality, it is common to have hierarchies of communities, in which 
each community can have sub/super communities. Hierarchical clustering 
deals with this scenario and generates community hierarchies. Initially, n 
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(a) Graph (b) Dendrogram 

Figure 6.9. An Example of Girvan-Newman Algorithm Example: (a) graph and (b) its 
hierarchical clustering dendrogram based on edge betweenness. 


nodes are considered as either 1 or « communities in hierarchical clus¬ 
tering. These communities are gradually merged or split (agglomerative 
or divisive hierarchical clustering algorithms), depending on the type of 
algorithm, until the desired number of communities are reached. A den¬ 
drogram is a visual demonstration of how communities are merged or 
split using hierarchical clustering. The Girvan-Newman [2002] algorithm 
is specifically designed for finding communities using divisive hierarchical 
clustering. 

The assumption underlying this algorithm is that, if a network has a 
set of communities and these communities are connected to one another 
with a few edges, then all shortest paths between members of different 
communities should pass through these edges. By removing these edges (at 
times referred to as weak ties), we can recover (i.e., disconnect) communities 
in a network. To find these edges, the Girvan-Newman algorithm uses a 
measure called edge betweenness and removes edges with higher edge 
betweenness. For an edge e, edge betweenness is defined as the number of 
shortest paths between node pairs (o,, Vj ) such that the shortest path between 
o, and vj passes through e. For instance, in Figure 6.9(a), edge betweenness 
for edge e(l, 2) is 6/2 +1 = 4, because all the shortest paths from 2 to 
{4, 5, 6, 7, 8, 9} have to either pass e(l, 2) or e(2, 3), and e(l, 2) is the 
shortest path between 1 and 2. Formally, the Girvan-Newman algorithm is 
as follows: 

1. Calculate edge betweeness for all edges in the graph. 

2. Remove the edge with the highest betweenness. 
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3. Recalculate betweenness for all edges affected by the edge removal. 

4. Repeat until all edges are removed. 


Example 6.4. Consider the graph depicted in Figure 6. 9(a). For this graph, 
the edge-betweenness values are as follows: 
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~0 0 0 0 

0 0 0 0 
0 0 0 0 
10 0 0 0 
16 3 0 
0 6 3 0 

6 0 2 8 
3 2 0 0 

0 8 0 0 


(6.25) 


Therefore, by following the algorithm, the first edge that needs to be 
removed is e(4, 5) (or e(4, 6 )). By removing e(4, 5), we compute the edge 
betweenness once again; this time, e(4, 6) has the highest betweenness 
value: 20. This is because all shortest paths between nodes {1,2,3,4} to 
nodes {5,6,7,8,9} must pass e(4, 6); therefore, it has betweenness 4 x 
5 = 20. By following the first few steps of the algorithm, the dendrogram 
shown in Figure 6.9(b) and three disconnected communities ({1, 2, 3, 4}, 
{5, 6, 7, 8}, {9}) can be obtained. 


We discussed various community detection algorithms in this sec¬ 
tion. Figure 6.10 summarizes the two categories of community detection 
algorithms. 


6.2 Community Evolution 

Community detection algorithms discussed so far assume that networks are 
static; that is, their nodes and edges are fixed and do not change over time. 
In reality, with the rapid growth of social media, networks and their internal 
communities change over time. Earlier community detection algorithms 
have needed to be extended to deal with evolving networks. Before analyz¬ 
ing evolving networks, we need to answer the question. How do networks 
evolve? In this section, we discuss how networks evolve in general and then 
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Figure 6.10. Community Detection Algorithms. 


how communities evolve over time. We also demonstrate how communities 
can be found in these evolving networks. 

6.2.1 How Networks Evolve 

Large social networks are highly dynamic, where nodes and links appear or 
disappear over time. In these evolving networks, many interesting patterns 
are observed; for instance, when distances (in terms of shortest path dis¬ 
tance) between two nodes increase, their probability of getting connected 
decreases. 1 2 We discuss three common patterns that are observed in evolving 
networks: segmentation, densification, and diameter shrinkage. 


Network Segmentation 

Often, in evolving networks, segmentation takes place, where the large 
network is decomposed over time into three parts: 

1. Giant Component: As network connections stabilize, a giant com¬ 
ponent of nodes is formed, with a large proportion of network nodes 
and edges falling into this component. 

2. Stars: These are isolated parts of the network that form star struc¬ 
tures. A star is a tree with one internal node and n leaves. 
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O 


O 


Figure 6.11. Network Segmentation. The network is decomposed into a giant component 
(dark gray), star components (medium gray), and singletons (light gray). 

3. Singletons: These are orphan nodes disconnected from all nodes in 
the network. 

Figure 6.11 depicts a segmented network and these three components. 


Graph Densification 


It is observed in evolving graphs that the density of the graph increases as 
the network grows. In other words, the number of edges increases faster 
than the number of nodes. This phenomenon is called densification. Let 
V(t) denote nodes at time t and let E(t ) denote edges at time t. 


m)\ oc|L(0r. 


(6.26) 


If densification happens, then we have 1 < a < 2. There is linear growth 
when a = 1, and we get clique structures when a = 2 (why?). Networks 
exhibit a values between 1 and 2 when evolving. Figure 6.12 depicts a 
log-log graph for densification for a physics citation network and a patent 
citation network. During the evolution process in both networks, the number 
of edges is recorded as the number of nodes grows. These recordings show 
that both networks have a & 1.6 (i.e., the log-log graph of \E\ with respect 
to | V\ is a straight line with slope 1.6). This value also implies that when 
V is given, to realistically model a social network, we should generate 
0 (| F|*' 6 ) edges. 


Number of edges 
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Number of nodes 

(a) Physics Citations (b) Patent Citations 

Figure 6.12. Graph Densification (from [Leskovec et al., 2005]). 


Diameter Shrinkage 

Another property observed in large networks is that the network diameter 
shrinks in time. This property has been observed in random graphs as well 
(see Chapter 4). Figure 6.13 depicts the diameter shrinkage for the same 
patent network discussed in Figure 6.12. 

In this section we discussed three phenomena that are observed in evolv¬ 
ing networks. Communities in evolving networks also evolve. They appear, 



Time [years] 

Figure 6.13. Diameter Shrinkage over Time for a Patent Citation Network (from 
[Leskovec et al., 2005]). 
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t-► t + 1 t -► t+1 

Figure 6.14. Community Evolution (reproduced from [Palla et al., 2007]). 


grow, shrink, split, merge, or even dissolve over time. Figure 6.14 depicts 
different situations that can happen during community evolution. 

Both networks and their internal communities evolve over time. Given 
evolution information (e.g., when edges or nodes are added), how can 
we study evolving communities? And can we adapt static (nontemporal) 
methods to use this temporal information? We discuss these questions next. 


6.2.2 Community Detection in Evolving Networks 

Consider an instant messaging (IM) application in social media. In these 
IM systems, members become “available” or “offline” frequently. Consider 
individuals as nodes and messages between them as edges. In this example, 
we are interested in finding a community of individuals who send messages 
to one another frequently. Clearly, community detection at any time stamp 
is not a valid solution because interactions are limited at any point in time. 
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A valid solution to this problem needs to use temporal information and 
interactions between users over time. In this section we present community 
detection algorithms that incorporate temporal information. To incorporate 
temporal information, we can extend previously discussed static methods 
as follows: 

1. Take t snapshots of the network, G\, G 2 ,..., G t , where G, is a 
snapshot at time i. 

2. Perform a static community detection algorithm on all snapshots 
independently. 

3. Assign community members based on communities found in all t 
different time stamps. For instance, we can assign nodes to commu¬ 
nities based on voting. In voting, we assign nodes to communities 
they belong to the most over time. 

Unfortunately, this method is unstable in highly dynamic networks 
because community memberships are always changing. An alternative is to 
use evolutionary clustering. 


Evolutionary Clustering 

In evolutionary clustering, it is assumed that communities do not change 
most of the time; hence, it tries to minimize an objective function that 
considers both communities at different time stamps (snapshot cost or 
SC) and how they evolve throughout time (temporal cost or TC). Then, 
the objective function for evolutionary clustering is defined as a linear 
combination of the snapshot cost and temporal cost (SC and T C), 

Cost = a SC + (1 - a) TC, (6.27) 

where 0 < a < 1. Let us assume that spectral clustering (discussed in Sec¬ 
tion 6.1.3) is used to find communities at each time stamp. We know that 
the objective for spectral clustering is Tr(X J LX) s.t. X T X = /„,, so we 
will have the objective function at time t as 

Cost t = a SC + (1 - a) TC, (6.28) 

= a Tr{X]LX,) + (1 - a) TC, (6.29) 

where X, is the community membership matrix at time t. To define TC, 
we can compute the distance between the community assignments of 
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two snapshots: 

TC=\\X,-X t _ 1 \\ 2 . (6.30) 

Unfortunately, this requires both X, and X t -\ to have the same number 
of columns (number of communities). Moreover, X, is not unique and 
can change by orthogonal transformations; ’ therefore, the distance value 
\\X t — X,_\ || 2 can change arbitrarily. To remove the effect of orthogonal 
transformations and allow different numbers of columns, TC is defined as 

TC = l -\\X t Xj - X,. x Xj_ x \\ 2 , 

= l 2 Tr((X t Xj - X^Xiy (X t Xj - X^Xl,)), 

= l -Tr(X t Xj’X t X'[ - 2X,X T t X,_ x X T t _ x + X,_ x X T t _ x X,_ x X T t _ x ), 

= Tr(l - X,Xf X,. x Xj_ x ), 

= Tr{l-XjX t - X Xj_ x X t ), (6.31) 

where ^ is for mathematical convenience, and Tr(AB) = Tr(BA) is used. 
Therefore, evolutionary clustering objective can be stated as 

Cost, = a Tr (xf LX,) + (1 - a) * | \X,X? - X,-iXf_ x \\\ 

= a Tr(X?LX t ) + (1 - a) Tr(l - Xj X,_ x Xj_ x X t ), 

= a Tr(xJ LX,) + (1 - a) Tr(XjIX, - XjX,^Xj_ x X,), 

= Tr(XjaLX,) + Tr (xf(l - a)IX, - Xj(\ - a)X,_ x Xj_ x X t ). 

(6.32) 

Assuming the normalized laplacian is used in spectral clustering, L = 

i - d; 1/2 a,d7 1/2 . 

Cost, = Tr{XJa{l - a; 1 ' 2 A,Di V2 ) X,) 

+ Tr(XJ (1 -a) IX,- X T t (1 - a) X,_ x Xj_ x X,), 

= Tr(Xj (/ - aD7 1/2 A,D7 l/2 - (1 - a) X,-\Xf_ ] ) X,), 

= Tr(X,LX t ), (6.33) 

where L = I — uD, 12 /I, D, 1/2 — (1 — a)X,_ x Xj_ x . Similar to spectral 
clustering, X, can be obtained by taking the top eigenvectors of L. 
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Figure 6.15. Commmunity Evaluation Example. Circles represent communities, and 
items inside the circles represent members. Each item is represented using a symbol, +, 
x, or A, that denotes the item's true label. 


Note that at time t, we can obtain X , directly by solving spectral cluster¬ 
ing for the laplacian of the graph at time t, but then we are not employing 
any temporal information. Using evolutionary clustering and the new lapla¬ 
cian L, we incorporate temporal information into our community detection 
algorithm and disallow user memberships in communities at time /: X, to 
change dramatically from X t -\. 


6.3 Community Evaluation 

When communities are found, one must evaluate how accurately the detec¬ 
tion task has been performed. In terms of evaluating communities, the task 
is similar to evaluating clustering methods in data mining. Evaluating clus¬ 
tering is a challenge because ground truth may not be available. We consider 
two scenarios: when ground truth is available and when it is not. 


6.3.1 Evaluation with Ground Truth 

When ground truth is available, we have at least partial knowledge of what 
communities should look like. Here, we assume that we are given the correct 
community (clustering) assignments. We discuss four measures: precision 
and recall, F-measure, purity, and normalized mutual information (NMI). 
Consider Figure 6.15, where three communities are found and the points 
are shown using their true labels. 


Precision and Recall 

Community detection can be considered a problem of assigning all similar 
nodes to the same community. In the simplest case, any two similar nodes 
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should be considered members of the same community. Based on our 
assignments, four cases can occur: 

1. True Positive (TP) Assignment: when similar members are assigned 
to the same community. This is a correct decision. 

2. True Negative (TN) Assignment: when dissimilar members are 
assigned to different communities. This is a correct decision. 

3. False Negative (FN) Assignment: when similar members are assigned 
to different communities. This is an incorrect decision. 

4. False Positive (FP) Assignment: when dissimilar members are 
assigned to the same community. This is an incorrect decision. 


Precision ( P ) and Recall (R) are defined as follows, 


P = 

R = 


TP 

TP+ FP’ 
TP 

TP + FN' 


(6.34) 

(6.35) 


Precision defines the fraction of pairs that have been correctly assigned to 
the same community. Recall defines the fraction of pairs that the community 
detection algorithm assigned to the same community of all the pairs that 
should have been in the same community. 


Example 6.5. We compute these values for Figure 6.15. For TP, we need 
to compute the number of pairs with the same label that are in the same 
community. For instance, for label x and community 1, we have ( 2 ) such 
pairs. Therefore, 



Community 1 Community 2 Community 3 


For FP, we need to compute dissimilar pairs that are in the same com¬ 
munity. For instance, for community 1, this is (5 x 1 + 5 x 1 + 1 x 1). 
Therefore, 


FP = (5 x 1 + 5 x 1 + 1 x 1) + (6x1) + (4x2) =25. 

Community 1 Community 2 Community 3 


(6.37) 
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FN computes similar members that are in different communities. For 
instance, for label +, this is (6 x 1 + 6 x 2 + 2 x 1). Similarly, 


FN = (5 x l) + (6 x 1 + 6 x 2 + 2 x l) + (4 x 1) = 29. 


(6.38) 


x + A 

Finally, TN computes the number of dissimilar pairs in dissimilar com¬ 


munities: 


Hence, 


x,+ +,x A,+ A.x 

TN = (5 x 6+1 x 1 + 1 x 6+1 x 1) 

'-V-' 

Communities 1 and 2 

x.A x,+ +,A A.+ 

+ (5t4 + 5t2 + T^4 + T^2) 


Communities 1 and 3 


+.A 


x,+ 


x.A 


+ (6x4+lx2+lx4 = 104. 

V '-v- y 

Communities 2 and 3 


P = 


32 


32 + 25 


= 0.56 


R = 


32 


32 + 29 


= 0.52. 


(6.39) 

(6.40) 

(6.41) 


F-Measure 


To consolidate precision and recall into one measure, we can use the har¬ 
monic mean of precision and recall: 


F = 2 ■ 


P • R 
P + R' 


Computed for the same example, we get F = 0.54. 


(6.42) 


Purity 

In purity, we assume that the majority of a community represents the com¬ 
munity. Hence, we use the label of the majority a community against the 
label of each member of the community to evaluate the algorithm. For 
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instance, in Figure 6.15, the majority in Community 1 is x; therefore, we 
assume majority label x for that community. The purity is then defined as 
the fraction of instances that have labels equal to their community’s majority 
label. Formally, 



(6.43) 


where k is the number of communities, N is the total number of nodes, 
L j is the set of instances with label j in all communities, and C, is the 
set of members in community i. In the case of our example, purity is 
= 0.75. 


20 


Normalized Mutual Information 


Purity can be easily manipulated to generate high values; consider when 
nodes represent singleton communities (of size 1) or when we have very 
large pure communities (ground truth = majority label). In both cases, 
purity does not make sense because it generates high values. 

A more precise measure to solve problems associated with purity is the 
normalized mutual information (NMI) measure, which originates in infor¬ 
mation theory. Mutual information (MI) describes the amount of informa¬ 
tion that two random variables share. In other words, by knowing one of 
the variables, MI measures the amount of uncertainty reduced regarding 
the other variable. Consider the case of two independent variables; in this 
case, the mutual information is zero because knowing one does not help 
in knowing more information about the other. Mutual information of two 
variables X and Y is denoted as I{X, Y). We can use mutual information to 
measure the information one clustering carries regarding the ground truth. 
It can be calculated using Equation 6.44, where L and H are labels and 
found communities; «/, and «/ are the number of data points in community 
h and with label /, respectively; is the number of nodes in community 
h and with label l ; and n is the number of nodes. 



(6.44) 


Unfortunately, mutual information is unbounded; however, it is common 
for measures to have values in range [0,1]. To address this issue, we can 
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normalize mutual information. We provide the following equation, without 
proof, which will help us normalize mutual information, 

MI < min(H(L), H(H)), (6.45) 

where H( ■) is the entropy function, 

H(L)=~Y,~ log- (6.46) 

U n 

H{H)=~Y,- 'og-. (6.47) 

f—f, n n 

hzH 

From Equation 6.45, we have MI < H(L) and MI < H ( H ): therefore, 
(MI) 2 < H(H)H(L). (6.48) 

Equivalently, 

MI < y/H(H)y/H(L). (6.49) 

Equation 6.49 can be used to normalize mutual information. Thus, we 
introduce the NMI as 


N MI = 


MI 

VWvJTWt)' 


By plugging Equations 6.47, 6.46, and 6.44 into 6.50, 


NMI = 


E/, e g E/ e z n hJ log ^ 
(ZheH «* lo § T)(D/ei «/ lo S t) 


(6.50) 


(6.51) 


An NMI value close to one indicates high similarity between commu¬ 
nities found and labels. A value close to zero indicates a long distance 
between them. 


6.3.2 Evaluation without Ground Truth 

When no ground truth is available, we can incorporate techniques based on 
semantics or clustering quality measures to evaluate community detection 
algorithms. 
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Figure 6.16. Tag Clouds for Two Communities. 


Evaluation with Semantics 

A simple way of analyzing detected communities is to analyze other 
attributes (posts, profile information, content generated, etc.) of community 
members to see if there is a coherency among community members. The 
coherency is often checked via human subjects. For example, the Amazon 
Mechanical Turk platform allows defining this task on its platform for 
human workers and hiring individuals from all around the globe to perform 
tasks such as community evaluation. To help analyze these communities, 
one can use word frequencies. By generating a list of frequent keywords for 
each community, human subjects determine whether these keywords rep¬ 
resent a coherent topic. A more focused and single-topic set of keywords 
represents a coherent community. Tag clouds are one way of demonstrating 
these topics. Figure 6.16 depicts two coherent tag clouds for a community 
related to the U.S. Constitution and another for sports. Larger words in these 
tag clouds represent higher frequency of use. 


Evaluation Using Clustering Quality Measures 

When experts are not available, an alternative is to use clustering quality 
measures. This approach is commonly used when two or more community 
detection algorithms are available. Each algorithm is run on the target net¬ 
work, and the quality measure is computed for the identified communities. 
The algorithm that yields a more desirable quality measure value is con¬ 
sidered a better algorithm. SSE (sum of squared errors) and inter-cluster 
distance are some of the quality measures. For other measures refer to 
Chapter 5. 

We can also follow this approach for evaluating a single community 
detection algorithm; however, we must ensure that the clustering quality 
measure used to evaluate community detection is different from the measure 
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used to find communities. For instance, when using node similarity to group 
individuals, a measure other than node similarity should be used to evaluate 
the effectiveness of clustering. 

6.4 Summary 

In this chapter, we discussed community analysis in social media, answering 
three general questions: (1) how can we detection communities, (2) how do 
communities evolve and how can we study evolving communities, and (3) 
how can we evaluate detected communities? We started with a description 
of communities and how they are formed. Communities in social media are 
either explicit (emic) or implicit (etic). Community detection finds implicit 
communities in social media. 

We reviewed member-based and group-based community detection algo¬ 
rithms. In member-based community detection, members can be grouped 
based on their degree, reachability, and similarity. For example, when using 
degrees, cliques are often considered as communities. Brute-force clique 
identification is used to identify cliques. In practice, due to the computa¬ 
tional complexity of clique identifications, cliques are either relaxed or used 
as seeds of communities. /c-Plcx is an example of relaxed cliques, and the 
clique percolation algorithm is an example of methods that use cliques as 
community seeds. When performing member-based community detection 
based on reachability, three frequently used subgraphs are the A'-clique, k- 
club, and A'-clan. Finally, in member-based community detection based on 
node similarity, methods such as Jaccard and Cosine similarity help com¬ 
pute node similarity. In group-based community detection, we described 
methods that find balanced, robust, modular, dense, or hierarchical com¬ 
munities. When finding balanced communities, one can employ spectral 
clustering. Spectral clustering provides a relaxed solution to the normalized 
cut and ratio cuts in graphs. For finding robust communities, we search 
for subgraphs that are hard to disconnect. A'-edge and Ar-vertex graphs are 
two examples of these robust subgraphs. To find modular communities, one 
can use modularity maximization and for dense communities, we discussed 
quasi-cliques. Finally, we provided hierarchical clustering as a solution to 
finding hierarchical communities, with the Girvan-Newman algorithm as 
an example. 

In community evolution, we discussed when networks and, on a lower 
level, communities evolve. We also discussed how commmunities can be 
detected in evolving networks using evolutionary clustering. Finally, we 
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presented how communities are evaluated when ground truth exists and 
when it does not. 


6.5 Bibliographic Notes 

A general survey of community detection in social media can be found in 
[Fortunato, 2009] and a review of heterogeneous community detection in 
[Tang and Liu, 2010], In related fields, [Berkhin, 2006; Xu et al., 2005; 
Jain et ah, 1999] provide surveys of clustering algorithms and [Wasserman 
and Faust, 1994] provides a sociological perspective. Comparative analysis 
of community detection algorithms can be found in [Lancichinetti and 
Fortunato, 2009] and [Leslcovec et al., 2010], The description of explicit 
communities in this chapter is due to Kadushin [2012]. 

For member-based algorithms based on node degree, refer to [Kumar 
et ah, 1999], which provides a systematic approach to finding clique-based 
communities with pruning. In algorithms based on node reachability, one 
can find communities by finding connected components in the network. 
For more information on finding connected components of a graph refer 
to [Hopcroft and Tarjan, 1971]. In node similarity, we discussed structural 
equivalence, similarity measures, and regular equivalence. More informa¬ 
tion on structural equivalence can be found in [Lorrain and White, 1971; 
Leicht et ah, 2005], on Jaccard similarity in [Jaccard, 1901], and on regular 
equivalence in [Stephen and Martin, 1993]. 

In group-based methods that find balanced communities, we are often 
interested in solving the max-flow min-cut theorem. Linear programming 
and Ford-Fulkerson [Cormen, 2009], Edmonds-Karp [Edmonds and Karp, 
1972], and Push-Relabel [Goldberg and Tarjan, 1988] methods are some 
established techniques for solving the max-flow min-cut problem. We dis¬ 
cussed quasi-cliques that help find dense communities. Finding the max¬ 
imum quasi-clique is discussed in [Pattillo et al., 2012]. A well-known 
greedy algorithm for finding quasi-cliques is introduced by [Abello et al., 
2002], In their approach a local search with a pruning strategy is performed 
on the graph to enhance the speed of quasi-clique detection. They define a 
peel strategy, in which vertices that have some degree k along with their 
incident edges are recursively removed. There are a variety of algorithms 
to find dense subgraphs, such as the one discussed in [Gibson et al., 2005] 
where the authors propose an algorithm that recursively fingerprints the 
graph (shindling algorithm) and creates dense subgraphs. In group-based 
methods that find hierarchical communities, we described hierarchical 
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clustering. Hierarchical clustering algorithms are usually variants of single 
link, average link, or complete link algorithms [Jain and Dubes, 1999], 
In hierarchical clustering, COBWEB [Fisher, 1987] and CHAMELEON 
[Karypis et ah, 1999] are two well-known algorithms. 

In group-based community detection, latent space models [Handcock 
et al., 2007; Hoff et al., 2002] are also very popular, but are not discussed in 
this chapter. In addition to the topics discussed in this chapter, community 
detection can also be performed for networks with multiple types of inter¬ 
action (edges) [Tang and Liu, 2009; Tang et al., 2012], We also restricted 
our discussion to community detection algorithms that use graph infor¬ 
mation. One can also perform community detection based on the content 
that individuals share on social media. For instance, using tagging relations 
(i.e., individuals who shared the same tag) [Wang et al., 2010], instead 
of connections between users, one can discover overlapping communities, 
which provides a natural summarization of the interests of the identified 
communities. 

In network evolution analysis, network segmentation is discussed in 
[Kumar et al., 2010]. Segment-based clustering [Sun et al., 2007] is another 
method not covered in this chapter. 

NMI was first introduced in [Strehl et al., 2002] and in terms of clus¬ 
tering quality measures, the Davies-Bouldin [Davies and Bouldin, 1979] 
measure, Rand index [Rand, 1971], C-index [Dunn, 1974], Silhouette index 
[Rousseeuw, 1987], and Goodman-Kruskal index [Goodman and Kruskal, 
1954] can be used. 


6.6 Exercises 

1. Provide an example to illustrate how community detection can be sub¬ 
jective. 


Community Detection 

2. Given a complete graph K „, how many nodes will the clique percolation 
method generate for the clique graph for value kl How many edges 
will it generate? 

3. Find all ^-cliques, k-clubs, and k-clans in a complete graph of size 4. 

4. For a complete graph of size n, is it m -connected? What possible values 
can m take? 

5. Why is the smallest eigenvector meaningless when using an unnormal¬ 
ized laplacian matrix? 
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6. Modularity can be defined as 



ij 


(6.52) 


where c, and c, are the communities for v, and vj, respectively. 

5{ci, Cj) (Kronecker delta) is 1 when v, and v f both belong to the 

same community (c, = cf), and 0 otherwise. 

• What is the range \a\, a?] for Q values? Provide examples for both 
extreme values of the range and cases where modularity becomes 
zero. 

• What are the limitations for modularity? Provide an example where 
modularity maximization does not seem reasonable. 

• Find three communities in Figure 6.8 by performing modularity 
maximization. 

7. For Figure 6.8: 

• Compute Jaccard and Cosine similarity between nodes «4 and 
v%, assuming that the neighborhood of a node excludes the node 


itself. 


• Compute Jaccard and Cosine similarity when the node is included in 
the neighborhood. 


Community Evolution 


8. What is the upper bound on densification factor a? Explain. 


Community Evaluation 


9. Normalized mutual information (NMI) is used to evaluate commu¬ 
nity detection results when the actual communities (labels) are known 
beforehand. 

• What are the maximum and minimum values for the NMI? Provide 
details. 

• Explain how NMI works (describe the intuition behind it). 

10. Compute NMI for Figure 6.15. 

11. Why is high precision not enough? Provide an example to show that 
both precision and recall are important. 

12. Discuss situations where purity does not make sense. 
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Community Analysis 


13. Compute the following for Figure 6.17: 



Figure 6.17. Commmunity Evaluation Example. 

• precision and recall 

• F-measure 

• NMI 

• purity 


7 

Information Diffusion in Social Media 


In February 2013, during the third quarter of Super Bowl XLVII, a power 
outage stopped the game for 34 minutes. Oreo, a sandwich cookie company, 
tweeted during the outage: “Power out? No Problem, You can still dunk it in 
the dark.” The tweet caught on almost immediately, reaching nearly 15,000 
retweets and 20,000 likes on Facebook in less than two days. A simple 
tweet diffused into a large population of individuals. It helped the company 
gain fame with minimum cost in an environment where companies spent as 
much as $4 million to run a 30-second ad. This is an example of information 
diffusion. 

Information diffusion is a field encompassing techniques from a plethora 
of sciences. In this chapter, discuss methods from fields such as sociology, 
epidemiology, and ethnography, which can help to social media mining. 
Our focus is on techniques that can model information diffusion. 

Societies provide means for individuals to exchange through various 
channels. For instance, people share knowledge with their immediate net¬ 
work (friends) or broadcast it via public media (TV newspapers, etc.) 
throughout the society. Given this flow of information, different research 
fields have disparate views of what is an information diffusion process. We 
define information diffusion as the process by which a piece of information 
(knowledge) is spread and reaches individuals through interactions. The 
diffusion process involves the following three elements: 

1. Sender(s). A sender or a small set of senders initiate the information 
diffusion process. 

2. Receiver(s). A receiver or a set of receivers receive diffused infor¬ 
mation. Commonly, the set of receivers is much larger than the set of 
senders and can overlap with the set of senders. 

3. Medium. This is the medium through which the diffusion takes 
place. For example, when a rumor is spreading, the medium can be 
the personal communication between individuals. 
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INTERVENTION 


LOCAL AND 
GLOBAL 
DEPENDENCE 


This definition can be generalized to other domains. In a disease¬ 
spreading process, the disease is the analog to the information, and infection 
can be considered a diffusing process. The medium in this case is the air 
shared by the infecter and the infectee. An information diffusion can be 
interrupted. We define the process of interfering with information diffusion 
by expediting, delaying, or even stopping diffusion as intervention. 

Individuals in online social networks are situated in a network where 
they interact with others. Although this network is at times unavailable or 
unobservable, the information diffusion process takes place in it. Individ¬ 
uals facilitate information diffusion by making individual decisions that 
allow information to flow. For instance, when a rumor is spreading, individ¬ 
uals decide if they are interested in spreading it to their neighbors. They can 
make this decision either dependently (i.e., depending on the information 
they receive from others) or independently. When they make dependent 
decisions, it is important to gauge the level of dependence that individ¬ 
uals have on others. It could be local dependence, where an individual’s 
decision is dependent on all of his or her immediate neighbors (friends) 
or global dependence, where all individuals in the network are observed 
before making decisions. 

In this chapter, we present in detail four general types of information 
diffusion: herd behavior, information cascades, diffusion of innovation, and 
epidemics. 

Herd behavior takes place when individuals observe the actions of all 
others and act in an aligned form with them. An information cascade 
describes the process of diffusion when individuals merely observe their 
immediate neighbors. In information cascades and herd behavior, the net¬ 
work of individuals is observable; however, in herding, individuals decide 
based on global information (global dependence); whereas, in information 
cascades, decisions are made based on knowledge of immediate neighbors 
(local dependence). 

Diffusion of innovations provides a bird’s-eye view of how an innova¬ 
tion (e.g., a product, music video, or fad) spreads through a population. It 
assumes that interactions among individuals are unobservable and that the 
sole available information is the rate at which products are being adopted 
throughout a certain period of time. This information is particularly inter¬ 
esting for companies performing market research, where the sole available 
information is the rate at which their products are being bought. These com¬ 
panies have no access to interactions among individuals. Epidemic models 
are similar to diffusion of innovations models, with the difference that the 
innovation’s analog is a pathogen and adoption is replaced by infection. 
Another difference is that in epidemic models, individuals do not decide 
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whether to become infected or not and infection is considered a random nat¬ 
ural process, as long as the individual is exposed to the pathogen. Figure 7.1 
summarizes our discussion by providing a decision tree of the information 
diffusion types. 


7.1 Herd Behavior 

Consider people participating in an online auction. Individuals are con¬ 
nected via the auction’s site where they cannot only observe the bidding 
behaviors of others but can also often view profiles of others to get a feel 
for their reputation and expertise. Individuals often participate actively in 
online auctions, even bidding on items that might otherwise be considered 
unpopular. This is because they trust others and assume that the high num¬ 
ber of bids that the item has received is a strong signal of its value. In this 
case, herd behavior has taken place. 

Flerd behavior, a term first coined by British surgeon Wilfred Trotter 
[1916], describes when a group of individuals performs actions that are 
aligned without previous planning. It has been observed in flocks, herds 
of animals, and in humans during sporting events, demonstrations, and 
religious gatherings, to name a few examples. In general, any herd behavior 
requires two components: 

1. connections between individuals 

2. a method to transfer behavior among individuals or to observe their 
behavior 
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Figure 7.2. Solomon Asch Experiment. Participants were asked to match the line on the 
left card to the line on the right card that has the exact same length. 


SOLOMON 

ASCH 

CONFORMITY 

EXPERIMENT 


Individuals can also make decisions that are aligned with others (mind¬ 
less decisions) when they conform to social or peer pressure. A well-known 
example is the set of experiments performed by Solomon Asch during the 
1950s [Asch, 1956]. In one experiment, he asked groups of students to 
participate in a vision test where they were shown two cards (Figure 7.2), 
one with a single line segment and one with three lines, and told to match 
the line segments with the same length. 

Each participant was put into a group where all the other group members 
were actually collaborators with Asch, although they were introduced as 
participants to the subject. Asch found that in control groups with no 
pressure to conform, in which the collaborators gave the correct answer, 
only 3% of the subjects provided an incorrect answer. However, when 
participants were surrounded by individuals providing an incorrect answer, 
up to 32% of the responses were incorrect. 

In contrast to this experiment, we refer to the process in which individuals 
consciously make decisions aligned with others by observing the decisions 
of other individuals as herding or herd behavior. In theory, there is no need 
to have a network of people. In practice, there is a network, and this network 
is close to a complete graph, where nodes can observe at least most other 
nodes. Consider this example of herd behavior. 


Example 7.1. Diners Example [Banerjee, 1992]. Assume you are visiting 
a metropolitan area that you are not familiar with. Planning for dinner, 
you find restaurant A with excellent review’s online and decide to go there. 
When arriving at A, you see that A is almost empty and that restaurant B, 
which is next door and serves the same cuisine, is almost full. Deciding to 
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go to B, based on the belief that other diners have also had the chance of 
going to A, is an example of herd behavior. 

In this example, when B is getting more and more crowded, herding 
is taking place. Herding happens because we consider crowd intelligence 
trustworthy. We assume that there must be private information not known to 
us, but known to the crowd, that resulted in the crowd preferring restaurant 
B over A. In other words, we assume that, given this private information, 
we would have also chosen B over A. 

In general, when designing a herding experiment, the following four 
conditions need to be satisfied: 

1. There needs to be a decision made. In this example, the decision 
involves going to a restaurant. 

2. Decisions need to be in sequential order. 

3. Decisions are not mindless, and people have private information that 
helps them decide. 

4. No message passing is possible. Individuals do not know the private 
information of others, but can infer what others know from what they 
observe from their behavior. 

Anderson and Holt [ 1996,1 997] designed an experiment satisfying those 
for conditions, in which students guess whether an urn containing red and 
blue marbles is majority red or majority blue. Each student had access to 
the guesses of students beforehand. Anderson and Holt observed a herd 
behavior where students reached a consensus regarding the majority color 
over time. It has been shown [Easley and Kleinberg, 2010] that Bayesian 
modeling is an effective technique for demonstrating why this herd behavior 
occurs. Simply put, computing conditional probabilities and selecting the 
most probable majority color result in herding over time. We detail this 
experiment and how conditional probabilities can explain why herding 
takes place next. 

7.1.1 Bayesian Modeling of Herd Behavior 

In this section, we show how Bayesian modeling can be used to explain herd 
behavior by describing in detail the urn experiment devised by Anderson 
and Holt [1996, 1997]. In front of a large class of students, there is an urn 
that has three marbles in it. These marbles are either blue (B) or red (R), 
and we are guaranteed to have at least one of each color. So, the urn is either 
majority blue (B,B,R) or majority red (R,R,B). We assume the probability 
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of being either majority blue or majority red is 50%. During the experiment, 
each student comes to the urn, picks one marble, and checks its color in 
private. The student predicts majority blue or red, writes the prediction 
on the blackboard (which was blank initially), and puts the marble back 
in the urn. Other students cannot see the color of the marble taken out, 
but can see the predictions made by the students regarding the majority 
color and written on the blackboard. Let the BOARD variable denote the 
sequence of predictions written on the blackboard. So, before the first 
student, it is 


BOARD: {} 

V. 


We start with the first student. If the marble selected is red, the prediction 
will be majority red; if blue, it will be majority blue. Assuming it was blue, 
on the board we have 


BOARD: {B} 


The second student can pick a blue or a red marble. If blue, he also 
predicts majority blue because he knows that the previous student must 
have picked blue. If red, he knows that because he has picked red and the 
first student has picked blue, he can randomly assume majority red or blue. 
So, after the second student we either have 


BOARD: {B,B} or BOARD: {B,R} 


Assume we end up with BOARD: {B, B}. In this case, if the third student 
takes out a red ball, the conditional probability is higher for majority blue, 
although she observed a red marble. Hence, a herd behavior takes place, and 
on the board, we will have BOARD: {B,B,B}. From this student and onward, 
independent of what is being observed, everyone will predict majority blue. 
Let us demonstrate why this happens based on conditional probabilities and 
our problem setting. In our problem, we know that the first student predicts 
majority blue if /’(majority blue| student’s obervation) >1/2 and majority 
red otherwise. We also know from the experiments setup that 

/’(majority blue) = /’(majority red) = 1/2, (7.1) 


/’(blue|majority blue) = P (red|majority red) = 2/3. (7.2) 







7 .1 Herd Behavior 


185 


Let us assume that the first student observes blue; then, 

. . P (blue I majority blue) P (majority blue) 

/’(majority blue | blue) = - 1 - 

/’(blue) 

/(blue) = /’(blue | majority blue)/’(majority blue) 
+ /’(blue|majority red)/’(majority red) 
= 2/3 x 1/2+ 1/3 x 1/2= 1/2. 


(7.3) 

(7.4) 

(7.5) 


Therefore, /’(majority blue|blue) = 2/3 [ % 1/3 = 2/3. So, if the first stu¬ 
dent picks blue, she will predict majority blue, and if she picks red, she 
will predict majority red. Assuming the first student picks blue, the same 
argument holds for the second student; if blue is picked, he will also predict 
majority blue. Now, in the case of the third student, assuming she has picked 
red, and having BOARD: {B,B} on the blackboard, then, 


/’(majority blue|blue, blue, red) = 


/’(blue,blue,red|majority blue) 


/’(blue, blue, red) 
x /’(majority blue) (7.6) 

/’(blue, blue, red|majority blue) = 2/3 x 2/3 x 1/3 = 4/27 (7.7) 

/’(blue, blue, red) = /’(blue, blue, red|majority blue) 
x /’(majority blue) 

+ /’(blue, blue, red|majority red) 
x /’(majority red) (7.8) 

= (2/3 x 2/3 x 1/3) x 1/2 

+ (1/3 x 1/3 x 2/3) x 1/2 = 1/9. 


Therefore, /(majority blue| blue,blue,red) = A/2 \jg /2 = 2/3. So, the 
third student predicts majority blue even though she picks red. Any stu¬ 
dent after the third student also predicts majority blue regardless of what is 
being picked because the conditional remains above 1/2. Note that the urn 
can in fact be majority red. For instance, when blue, blue, red is picked, 
there is a 1 — 2 /3 ='/3 chance that it is majority red; however, due to herd¬ 
ing, the prediction could become incorrect. Figure 7.3 depicts the herding 
process. In the figure, rectangles represent the board status, and edge values 
represent the observations. Dashed arrows depict transitions between states 
that contain the same statistical information that is available to the students. 
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Figure 7.3. Urn Experiment. Rectangles represent student predictions written on the 
blackboard, and edge values represent what the students observe. Rectangles are filled 
with the most likely majority, computed from conditional probabilities. 


7.1.2 Intervention 

As herding converges to a consensus over time, it is interesting how one 
can intervene with this process. In general, intervention is possible by pro¬ 
viding private information to individuals that was not previously available. 
Consider an urn experiment where individuals decide on majority red over 
time. Either (1) a private message to individuals informing them that the 
urn is majority blue or (2) writing the observations next to predictions on 
the board stops the herding and changes decisions. 

7.2 Information Cascades 

In social media, individuals commonly repost content posted by others 
in the network. This content is often received via immediate neighbors 
(friends). An information cascade occurs as information propagates through 
friends. 

Formally, an information cascade is defined as a piece of information or 
decision being cascaded among a set of individuals, where (1) individuals 
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are connected by a network and (2) individuals are only observing decisions 
of their immediate neighbors (friends). Therefore, cascade users have less 
information available to them compared to herding users, where almost all 
information about decisions are available. 

There are many approaches to modeling information cascades. Next, we 
introduce a basic model that can help explain information cascades. 


7.2.1 Independent Cascade Model (ICM) 

In this section, we discuss the independent cascade model (ICM) [Kempe 
et al., 2003] that can be utilized to model information cascades. Variants 
of this model have been discussed in the literature. Here, we discuss the 
one detailed by Kempe et al. [2003], Interested readers can refer to the 
bibliographic notes for further references. Underlying assumptions for this 
model include the following: 

• The network is represented using a directed graph. Nodes are actors 
and edges depict the communication channels between them. A node 
can only influence nodes that it is connected to. 

• Decisions are binary - nodes can be either active or inactive. An active 
nodes means that the node decided to adopt the behavior, innovation, 
or decision. 

• A node, once activated, can activate its neighboring nodes. 

• Activation is a progressive process, where nodes change from inactive 
to active, but not vice versa. 1 

Considering nodes that are active as senders and nodes that are being 
activated as receivers, in the independent cascade model (ICM) senders 
activate receivers. Therefore, ICM is denoted as a sender-centric model. In 
this model, the node that becomes active at time t has, in the next time step 
t + 1, one chance of activating each of its neighbors. Let v be an active 
node at time t. Then, for any neighbor w, there is a probability p v , w that 
node w gets activated at t + 1. A node u that has been activated at time t 
has a single chance of activating its neighbor w and that activation can only 
happen at t + 1. We start with a set of active nodes and we continue until 
no further activation is possible. Algorithm 7.1 details the process of the 
ICM model. 

Example 7.2. Consider the network in Figure 7.4 as an example. The 
neU\’ork is undirected; therefore, we assume p v , w = p w , v . Since it is 


SENDER¬ 

CENTRIC 

MODEL 
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Algorithm 7.1 Independent Cascade Model (ICM) 

Require: Diffusion graph G(V, E ), set of initial activated nodes A 0 , acti¬ 
vation probabilities p UtW 
i: return Final set of activated nodes A^ 

2 : i =0; 

3: while Aj f {} do 

4: 

5: / — i ~F 1 ; 

6 : A, = {}; 

7: for all o € Aj-i do 

8: for all w neighbor of v,w £ U' =0 Aj do 

9: rand =generate a random number in [0,1]; 

10 : if rand < p u , w then 

ii: activate w; 

12 : At = Ai U {to}; 

13: end if 

14: end for 

15: end for 

16 : end while 
17: A oo = Uy_Q^4^-, 

18 : Return A^; 


undirected, for any two vertices connected via an edge, there is an equal 
chance of one activating the other. Consider the network in step 1. The 
values on the edges denote p v ^ w s. The ICM procedure starts with a set of 
nodes activated. In our case, it is node v\. Each activated node gets one 
chance of activating its neighbors. The activated node generates a random 
number for each neighbor. If the random number is less than the respective 
Pv,m of the neighbor (see Algorithm 7.1, lines 9-11), the neighbor gets acti¬ 
vated. The random numbers generated are shown in Figure 7.4 in the form 
of inequalities, where the left-hand side is the random number generated 
and the right-hand side is the p D ,w ^ s depicted, by following the procedure 
after five steps, five nodes get activated and the ICM procedure converges. 

Clearly, the ICM characterizes an information diffusion process. It 
is sender-centered, and once a node is activated, it aims to activate all its 
neighboring nodes. Node activation in ICM is a probabilistic process. Thus, 
we might get different results for different runs. 
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Figure 7.4. Independent Cascade Model (ICM) Simulation. The numbers on the edges 
represent the weights p v , m . When there is an inequality, the activation condition is 
checked. The left number denotes the random number generated, and the right number 
denotes weight p VUJ . 


One interesting question when dealing with the ICM model is that given a 
network, how to activate a small set of nodes initially such that the final num¬ 
ber of activated nodes in the network is maximized. We discuss this next. 

7.2.2 Maximizing the Spread of Cascades 

Consider a network of users and a company that is marketing a product. 
The company is trying to advertise its product in the network. The company 
has a limited budget; therefore, not all users can be targeted. However, 
when users find the product interesting, they can talk with their friends 
(immediate neighbors) and market the product. Their neighbors, in turn, 
will talk about it with their neighbors, and as this process progresses, the 
news about the product is spread to a population of nodes in the network. 
The company plans on selecting a set of initial users such that the size of 
the final population talking about the product is maximized. 
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SUBMODULAR 

FUNCTION 


Formally, let S denote a set of initially activated nodes (seed set) in 
ICM. Let f(S) denote the number of nodes that get ultimately activated 
in the network if nodes in S are initially activated. For our ICM example 
depicted in Figure 7.4, |S| = 1 and f(S) = 5. Given a budget k, our goal is 
to find a set S such that its size is equal to our budget |S| = k and f(S) is 
maximized. 

Since the activations in ICM depend on the random number generated 
for each node (see line 9, Algorithm 7.1), it is challenging to determine 
the number of nodes that ultimately get activated / (S) for a given set S. In 
other words, the number of ultimately activated individuals can be different 
depending on the random numbers generated. ICM can be made deter¬ 
ministic (nonrandom) by generating these random numbers in the begin¬ 
ning of the ICM process for the whole network. In other words, we can 
generate a random number r UiW for any connected pair of nodes. Then, 
whenever node v has a chance of activating a, instead of generat¬ 
ing the random number, it can compare r u w with p„ Ju . Following this 
approach, ICM becomes deterministic, and given any set of initially acti¬ 
vated nodes S, we can compute the number of ultimately activated nodes 
f(S). 

Before finding S, we detail properties of f(S). The function f(S) is 
non-negative because for any set of nodes S, in the worst case, no node gets 
activated. It is also monotone: 

nSU{v})>f(S). (7.9) 

This is because when a node is added to the set of initially activated nodes, 
it either increases the number of ultimately activated nodes or keeps them 
the same. Finally, f(S) is submodular. A set function / is submodular if 
for any finite set N, 

VS C T c N, Vv € N \ T, f(S U {»}) - f(S) > f(T U {«}) - f(T). 

(7.10) 

The proof that function / is submodular is beyond the scope of this book, 
but interested readers are referred to [Kempe et al., 2003] for the proof. 
So, / is non-negative, monotone, and submodular. Unfortunately, for a 
submodular non-negative monotone function /, finding a k element set S 
such that /(.S') is maximized is an NP-hard problem [Kempe et al., 2003], 
In other words, we know no efficient algorithm for finding this set. Often, 
when a computationally challenging problem is at hand, approximation 
algorithms come in handy. In particular, the following theorem helps us 
approximate S. 
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Algorithm 7.2 Maximizing the spread of cascades - Greedy algorithm 

Require: Diffusion graph G( V, E), budget k 
l: return Seed set S (set of initially activated nodes) 

2 : i = 0; 

3: s={}; 

4: while i f k do 

5: v = argmax ueF \ 5 f(S U {;;}); 

or equivalently argmax„ eF \s f(S U {o}) — f(s) 

6: 5=S'U{u}; 

7: / = / -f- 1 j 

8 : end while 
9: Return S; 


Theorem 7.1. Kempe et al. [2003] Let f be a (1) non-negative, (2) mono¬ 
tone, and (3) submodular set function. Construct k-element set S, each time 
by adding node o, such that f(S U {}) (or equivalently, f(S U {o}) — f(s)) 
is maximized. Let ^P 1111 ™ 1 be the k-element set such that f is maximized. 
Then f(S) > (1 - i)/(5°P timal ). 

This theorem states that by constructing the set S greedily one can get at 
least a (1 — 1/e) 63% approximation of the optimal value. Algorithm 7.2 

details this greedy approach. The algorithm starts with an empty set S and 
adds node v\, which ultimately activates most other nodes if activated. For¬ 
mally, v ] is selected such that f([v i}) is the maximum. The algorithm then 
selects the second node 02 such that f({v 1 , 02 }) is maximized. The process 
is continued until the k^ node V[ ; is selected. Following this algorithm, 
we find an approximately reasonable solution for the problem of cascade 
maximization. 

Example 7.3. For the following graph, assume that node i activates node 
j when \i — j\ = 2 (mod 3). Solve cascade maximization for k = 2. 
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To find the first node v, wecompute f({v})forallv. We start with node 1. 
At time 0, node 1 can only activate node 6, because 

|1 -6| =2 (mod 3), (7.11) 

|1 -5| ^2 (mod 3). (7.12) 

At time 1, node 1 can no longer activate others, but node 6 is active and 
can activate others. Node 6 has outgoing edges to nodes 4 and 5. From 4 
and 5, node 6 can only activate 4: 

16 — 4| =2 (mod 3) (7.13) 

|6 — 5| ^ 2 {mod 3). (7.14) 

At time 2, node 4 is activated. It has a single out-link to node 2 
and since |4 — 2| =2 (mod 3), 2 is activated. Node 2 cannot activate 
other nodes; therefore, /({l}) = 4. Similarly, we find that /({2}) = 1, 
/({3}) = 1, /({4}) = 2, /({5}) = 1, and /({6}) = 4. So, 1 or 6 can be 
chosen for our first node. Let us choose 6. If 6 is initially activated, 
nodes 1, 2, 4, and 6 will become activated at the end. Now, from the 
set {1, 2, 3, 4, 5, 6} \ {1, 2, 4, 6} = {3,5}, we need to select one more node. 
This is because in the setting for this example, /({6, 1}) = /({6,2}) = 
/({6, 4}) = /({6}) = 4. In general, one needs to compute f (S U [v }) for all 
v e V \ S (see Algorithm 1.2, line 5). We have /({6, 3}) = /({5, 3}) = 5, 
so we can select one node randomly. We choose 3. So, S = {6, 3} and 
AS) = 5. 


7.2.3 Intervention 

Consider a false rumor spreading in social media. This is an example 
where we are interested in stopping an information cascade in social media. 
Intervention in the independent cascade model can be achieved using three 
methods: 

1. By limiting the number of out-links of the sender node and potentially 
reducing the chance of activating others. Note that when the sender 
node is not connected to others via directed edges, no one will get 
activated by the sender. 

2. By limiting the number of in-links of receiver nodes and therefore 
reducing their chance of getting activated by others. 

3. By decreasing the activation probability of a node (p v , w ) and there¬ 
fore reducing the chance of activating others. 
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7.3 Diffusion of Innovations 

Diffusion of innovations is a phenomenon observed regularly in social 
media. A music video going viral or a piece of news being retweeted many 
times are examples of innovations diffusing across social networks. As 
defined by Rogers [2003], an innovation is “an idea, practice, or object 
that is perceived as new by an individual or other unit of adoption.” 
Innovations are created regularly; however, not all innovations spread 
through populations. The theory of diffusion of innovations aims to answer 
why and how these innovations spread. It also describes the reasons behind 
the diffusion process, the individuals involved, and the rate at which ideas 
spread. In this section, we review characteristics of innovations that are 
likely to be diffused through populations and detail well-known models in 
the diffusion of innovations. Finally, we provide mathematical models that 
can model the process of diffusion of innovations and describe how we can 
intervene with these models. 

7.3.1 Innovation Characteristics 

For an innovation to be adopted, the individual adopting it (adopter) and the 
innovation must have certain qualities. 

Innovations must be highly observable, should have a relative advantage 
over current practices, should be compatible with the sociocultural paradigm 
to which it is being presented, should be observable under various trials 
(trialability), and should not be highly complex. 

In terms of individual characteristics, many researchers Rogers [2003]; 
Hirschman [1980] claim that the adopter should adopt the innovation earlier 
than other members of his or her social circle ( innovativeness ). 

7.3.2 Diffusion of Innovations Models 

Some of the earliest models for diffusion of innovations were provided by 
Gabriel Tarde in the early 20th century Tarde [1907]. In this section, we 
review basic diffusion of innovations models. Interested readers may refer 
to the bibliographical notes for further study. 


Ryan and Gross: Adopter Categories 

Ryan and Gross [1943] studied the adoption of hybrid seed corn by farmers 
in Iowa [Strang and Soule, 1998]. The hybrid seed corn was highly resis¬ 
tant to diseases and other catastrophes such as droughts. However, farmers 
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Adopters Majority Majority 
2.5% 13.5% 34% 34% 16% 

Figure 7.5. Types of Adopters and S-Shaped Cumulative Adoption Curve. 

did not adopt it because of its high price and the seed’s inability to reproduce. 
Their study showed that farmers received information through two main 
channels: mass communications from companies selling the seeds and 
interpersonal communications with other farmers. They found that although 
farmers received information from the mass channel, the influence on their 
behavior was coming from the interpersonal channel. They argued that 
adoption depended on a combination of information from both channels. 
They also observed that the adoption rate follows an 5-shaped curve and 
that there are five different types of adopters based on the order in which 
they adopt the innovations: (1) Innovators (top 2.5%), (2) Early Adopters 
(13.5%), (3) Early Majority (34%), (4) Late Majority (34%), and (5) Lag¬ 
gards (16%). Figure 7.5 depicts the distribution of these adopters as well as 
the cumulative adoption S-shaped curve. As shown in the figure, the adop¬ 
tion rate is slow when innovators or early adopters adopt the product. Once 
early majority individuals start adopting, the adoption curve becomes linear, 
and the rate is constant until all late majority members adopt the product. 
After the late majority adopts the product, the adoption rate becomes slow 
once again as laggards start adopting, and the curve slowly approaches 
100 %. 


Katz: Two-Step Flow Model 

Elihu Katz, a professor of communication at the University of Pennsyl¬ 
vania, is a well-known figure in the study of the flow of information. In 
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/''"'N Individuals in Social Contact 
with an Opinion Leader 


Figure 7.6. Katz Two-Step Flow Model. 


addition to a study similar to the adoption of hybrid corn seed on how physi¬ 
cians adopted the new tetracycline drug [Coleman et al., 1966], Katz also 
developed a two-step flow model (also known as the multistep flow model) 
[Katz and Lazarsfeld, 2005] that describes how information is delivered 
through mass communication. The basic idea is depicted in Figure 7.6. 
Most information comes from mass media and is then directed toward 
influential figures called opinion leaders. These leaders then convey the 
information (or form opinions) and act as hubs for other members of the 
society. 
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Rogers: Diffusion of Innovations Process 


Rogers in his well-known book, Diffusion of Innovations Rogers [2003], 
discusses various theories regarding the diffusion of innovations process. 
In particular, he describes a five stage process of adoption: 

1. Awareness: In this stage, the individual becomes aware of the inno¬ 
vation, but her information about the product is limited. 

2. Interest: The individual shows interest in the product and seeks more 
information. 

3. Evaluation: The individual tries the product in his mind and decides 
whether or not to adopt it. 

4. Trial: The individual performs a trial use of the product. 

5. Adoption: The individual decides to continue the trial and adopts 
the product for full use. 


7.3.3 Modeling Diffusion of Innovations 


To effectively make use of the theories regarding the diffusion of innova¬ 
tions, we demonstrate a mathematical model for it in this section. The model 
incorporates basic elements discussed so far and can be used to effectively 
model a diffusion of innovations process. It can be concretely described as 



(7.15) 


Here, A(t) denotes the total population that adopted the innovation until 
time t. i(t) denotes the coefficient of diffusion, which describes the inno¬ 
vativeness of the product being adopted, and P denotes the total number 
of potential adopters (until time t). This equation shows that the rate at 
which the number of adopters changes throughout time depends on how 
innovative is the product being adopted. The adoption rate only affects the 
potential adopters who have not yet adopted the product. Since A(t) is the 
total population of adopters until time t, it is a cumulative sum and can be 
computed as follows: 



(7.16) 


where a(t) defines the adopters at time t. Let Aq denote the number of 
adopters at time to. There are various methods of defining the diffusion coef¬ 
ficient [Mahajan, 1985]. One way is to define i(t) as a linear combination 
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of the cumulative number of adopters at different times A(t), 


/(f) = a + aoAo + • • • + a t A(t) = a + ^ et,M(/), (7.17) 


where a, ’s are the weights for each time step. Often a simplified version 
of this linear combination is used. In particular, the following three models 
for computing i(t) are considered in the literature: 


i(t) = a, External-Influence Model (7.18) 

i(t) = jBA(t), Internal-Influence Model (7.19) 

i{t) = a + f>A(t), Mixed-Influence Model (7.20) 


where a is the external-influence factor and f> is the imitation factor. 
Equation 7.18 describes i(t) in terms of a only and is independent of 
the current number of adopters A(t); therefore, in this model, the adoption 
only depends on the external influence. In the second model, i(t) depends 
on the number of adopters at any time and is therefore dependent on the 
internal factors of the diffusion process, fi defines how much the current 
adopter population is going to affect the adoption and is therefore denoted 
as the imitation factor. The mixed-influence model is a model between the 
two that uses a linear combination of both previous models. 


External-Influence Model 


In the external-influence model, the adoption coefficient only depends on 
an external factor. One such example of external influence in social media 
is when important news goes viral. Often, people who post or read the news 
do not know each other; therefore, the importance of the news determines 
whether it goes viral. The external-influence model can be formulated as 


EXTERNAL 

INFLUENCE 

FACTOR 


IMITATION 

FACTOR 



(7.21) 


By solving Equation 7.21, 


A(t) = P(1 - e- at ). 


(7.22) 


when A(t = to = 0) = 0. The A(t) function is shown in Figure 7.7. The 
number of adopters increases exponentially and then saturates near P. 



198 


Information Diffusion in Social Media 


100 
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0 50 100 150 200 250 300 


Figure 7.7. External-Influence Model for P = 100 and a = 0.01. 


Internal-Influence Model 


In the internal-influence model, adoption depends on how many have 
adopted the innovation in the current time step. 1 In social media there 
is internal influence when a group of friends join a site due to peer pressure. 
Think of a group of individuals where the likelihood of joining a social net¬ 
working site increases as more group members join the site. The internal 
influence model can be described as follows: 


dA(t) 

—^= jSA(t)[P - A(t)\. 


(7.23) 


pure Since the diffusion rate in this model depends on fA(t), it is called the pure 
imitation imitation model. The solution to this model is defined as 

MODEL 


P 



(7.24) 


where A(t = to) = A 0 . The A(t) function is shown in Figure 7.8. 
Mixed-Influence Model 

As discussed, the mixed influence model is situated in between the internal- 
and external-influence models. The mixed-influence model is defined as 


dA(t) 

= (a + f}A{t))[P — A{t)\. 


(7.25) 
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Figure 7.8. Internal-Influence Model for A 0 = 30, /? = 10 5 , and P = 200. 


By solving the differential equation, we arrive at 


A(t) = 


p _ a(P-A 0 ) -(q+0p)(t-tn) 
a+fiAo K _ 

1 , P(P-Ap) (a+ p P)(t - t „) ’ 
^ a+pAv e 


(7.26) 


where A(t = to) = Aq. The A(t) function for the mixed-influence model is 
depicted in Figure 7.9. 



Figure 7.9. Mixed-Influence Model for P — 200, /? = 10 5 , A 0 = 30, and a = 10 3 . 
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We discussed three models in this section: internal, external, and mixed 
influence. Depending on the model used to describe the diffusion of inno¬ 
vations process, the respective equation for A(t) (Equations 7.22, 7.24, or 
7.26) should be employed to model the system. 

7.3.4 Intervention 

Consider a faulty product being adopted. The product company is planning 
to stop or delay adoptions until the product is fixed and re-released. This 
intervention can performed by doing the following: 

• Limiting the distribution of the product or the audience that can adopt 
the product. In our mathematical model, this is equivalent to reducing 
the population P that can potentially adopt the product. 

• Reducing interest in the product being sold. For instance, the company 
can inform adopters of the faulty status of the product. In our models, 
this can be achieved by tampering a : setting a to a very small value 
in Equation 7.22 results in a slow adoption rate. 

• Reducing interactions within the population. Reduced interactions 
result in less imitation of product adoptions and a general decrease in 
the trend of adoptions. In our models, this can be achieved by setting 
f to a small value. 


7.4 Epidemics 

In an Epidemic a disease spreads widely within a population. This process 
consists of a pathogen (the disease being spread), a population of hosts 
(humans, animals, and plants, among others), and a spreading mechanism 
(breathing, drinking, sexual activity, etc.). Unlike information cascades and 
herding, but similar to diffusion of innovations models, epidemic models 
assume an implicit network and unknown connections among individuals. 
This makes epidemic models more suitable when we are interested in global 
patterns, such as trends and ratios of people getting infected, and not in who 
infects whom. 

In general, a complete understanding of the epidemic process requires 
substantial knowledge of the biological process within each host and the 
immune system process, as well as a comprehensive analysis of interac¬ 
tions among individuals. Other factors such as social and cultural attributes 
also play a role in how, when, and where epidemics happen. Large epi¬ 
demics, also known as pandemics, have spread through human populations 
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and include the Black Death in the 13th century (killing more than 50% 
of Europe’s population), the Great Plague of London (100,000 deaths), 
the smallpox epidemic, in the 17th century (killing more than 90% of 
Massachusetts Bay Native Americans) and recent pandemics such as 
HIV/AIDS, SARS, H5N1 (Avian flu), and influenza. These motivated the 
introduction of epidemic models in the early 20th century and the estab¬ 
lishment of the epidemiology field. 

There are various ways of modeling epidemics. For instance, one can 
look at how hosts contact each other and devise methods that describe how 
epidemics happen in networks. These networks are called contact networks. 
A contact network is a graph where nodes represent the hosts and edges 
represent the interactions between these hosts. For instance, in the case of 
the HIV/AIDS epidemic, edges represent sexual interactions, and in the 
case of influenza, nodes that are connected represent hosts that breathe 
the same air. Nodes that are close in a contact network are not necessarily 
close in terms of real-world proximity. Real-world proximity might be 
true for plants or animals, but diseases such as SARS or avian flu travel 
between continents because of the traveling patterns of hosts. This spreading 
pattern becomes clearer when the science of epidemics is employed to 
understand the propagation of computer viruses in cell phone networks or 
across the internet [Pastor-Satorras and Vespignani, 2001; Newman et al., 
2002 ], 

Another way of looking at epidemic models is to avoid considering 
network information and to analyze only the rates at which hosts get 
infected, recover, and the like. This analysis is known as the fully mixed 
technique, assuming that each host has an equal chance of meeting other 
hosts. Through these interactions, hosts have random probabilities of getting 
infected. Though simplistic, the technique reveals several useful methods of 
modeling epidemics that are often capable of describing various real-world 
outbreaks. In this section, we concentrate on the fully mixed models that 
avoid the use of contact networks/ 

Note that the models of information diffusion that we have already 
discussed, such as the models in diffusion of innovations or information 
cascades, are more or less related to epidemic models. However, what 
makes epidemic models different is that, in the other models of information 
diffusion, actors decide whether to adopt the innovation or take the decision 
and the system is usually fully observable. In epidemics, however, the 
system has a high level of uncertainty, and individuals usually do not 
decide whether to get infected or not. The models discussed in this section 


CONTACT 

NETWORKS 
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assume that (1) no contact network information is available and (2) the 
process by which hosts get infected is unknown. These models can be 
applied to situations in social media where the decision process has a 
certain uncertainty to it or is ambiguous to the analyst. 


7.4.1 Definitions 

Since there is no network, we assume that we have a population where the 
disease is being spread. Let N define the size of this crowd. Any member 
of the crowd can be in either one of three states: 


CLOSED- 

WORLD 

ASSUMPTION 


1. Susceptible: When an individual is in the susceptible state, he or she 
can potentially get infected by the disease. In reality, infections can 
come from outside the population where the disease is being spread 
(e.g., by genetic mutation, contact with an animal, etc.); however, for 
simplicity, we make a closed-world assumption, where susceptible 
individuals can only get infected by infected people in the population. 
We denote the number of susceptibles at time t as S(t) and the fraction 
of the population that is susceptible as s(t) = S(t)/N. 

2. Infected: An infected individual has the chance of infecting suscep¬ 
tible parties. Let I(t) denote the number of infected individuals at 
time t, and let i (t) denote the fraction of individuals who are infected, 
id) = m/N. 

3. Recovered (or Removed): These are individuals who have either 
recovered from the disease and hence have complete or partial immu¬ 
nity against the infection or were killed by the infection. Let R(t) 
denote the size of this set at time t and r(t) the fraction recovered, 
r(t) = R(t)/N . 


Clearly, N = S(t) + /(f) + R(t) for all t. Since we are assuming that 
there is some level of randomness associated with the values of S(t), I(t), 
and R(t), we try to deal with expected values and assume S, /, and R 
represent these at time f. 


7.4.2 SI Model 

We start with the most basic model. In this model, the susceptible individuals 
get infected, and once infected, they will never get cured. Denote f> as the 
contact probability. In other words, the probability of a pair of people 
meeting in any time step is f>. So, if fl = 1, everyone comes into contact 
with everyone else, and if f = 0, no one meets another individual. Assume 
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Figure 7.10. SI Model. 


that when an infected individual meets a susceptible individual the disease 
is being spread with probability 1 (this can be generalized to other values). 
Figure 7.10 demonstrates the SI model and the transition between states that 
happens in this model for individuals. The value over the arrow shows that 
each susceptible individual meets at least /?/ infected individuals during 
the next time step. 

Given this situation, infected individuals will meet ft N people on aver¬ 
age. We know from this set that only the fraction S/N will be susceptible 
and that the rest are infected already. So, each infected individual will 
infect PNS/N = f>S others. Since / individuals are infected, p /S will be 
infected in the next time step. This means that the number of susceptible 
individuals will be reduced by this factor as well. So, to get different val¬ 
ues of S and I at different times, we can solve the following differential 
equations: 

dS 

— = -PIS, (7.27) 

dl 

— = PIS. (7.28) 

dt 

Since S + I = N at all times, we can eliminate one equation by replacing 
S with N — /: 


§7 = pun -1). 

dt 


(7.29) 


The solution to this differential equation is called the logistic growth 
function. 


NIoe^ 

~ N+I 0 (ef“-l)’ 


(7.30) 


where I 0 is the number of individuals infected at time 0. In general, analyz¬ 
ing epidemics in terms of the number of infected individuals has nominal 
generalization power. To address this limitation, we can consider infected 
fractions. We therefore substitute z'o = in the previous equation, 


ipe pt 

l+ioie^-lY 


i(t ) = 


(7.31) 
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New Cases of AIDS in The United States 



t 

(a) SI Model Simulation (b) HIV/AIDS Infected Population Growth 

Figure 7.11. SI model simulation compared to the HIV/AIDS growth in the United 
States. 


Note that in the limit, the SI model infects all the susceptible population 
because there is no recovery in the model. Figure 7.1 1(a) depicts the logistic 
growth function (infected individuals) and susceptible individuals for N = 
100, 7 0 = 1, and (> = 0.003. Figure 7.1 1(b) depicts the infected population 
for HIV/AIDS for the past 20 years. As observed, the infected population 
can be approximated well with the logistic growth function and follows the 
SI model. Note that in the HIV/AIDS graph, not everyone is getting infected. 
This is because not everyone in the United States is in the susceptible 
population, so not everyone will get infected in the end. Moreover, there 
are other factors that are far more complex than the details of the SI model 
that determine how people get infected with HIV/AIDS. 

7.4.3 SIR Model 

The SIR model, first introduced by Kermack and McKendriclc [1932], adds 
more detail to the standard SI model. In the SIR model, in addition to the I 
and S states, a recovery state R is present. Figure 7.12 depicts the model. 
In the SIR model, hosts get infected, remain infected for a while, and then 
recover. Once hosts recover (or are removed), they can no longer get infected 
and are no longer susceptible. The process by which susceptible individuals 
get infected is similar to the SI model, where a parameter fl defines the 



Figure 7.12. SIR Model. 
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probability of contacting others. Similarly, a parameter y in the SIR model 
defines how infected people recover, or the recovering probability of an 
infected individual in a time period At. 

In terms of differential equations, the SIR model is 


ft.1 = 

'■'* | G 

H 

i 

h —( 
Co 

(7.32) 

dl 

— = piS-yl, 
dt 

(7.33) 

dR 

- = y I. 

dt 7 

(7.34) 

Equation 7.32 is identical to that of the SI model (Equation 7.27). Equa¬ 
tion 7.33 is different from Equation 7.28 of the SI model by the addition of 
the term y I, which defines the number of infected individuals who recov¬ 
ered. These are removed from the infected set and are added to the recovered 
ones in Equation 7.34. Dividing Equation 7.32 by Equation 7.34, we get 

dS _ p 
dR y ’ 

(7.35) 

and by assuming the number of recovered at time 0 is zero ( Rq 

= 0), 

si 

II 

to 1^ 

toX) 

O 

(7.36) 

P n 

So = Sey R 

(7.37) 

Co 

II 

Se 

i 

>3 

(7.38) 

Since I + S + R = N, we replace / in Equation 7.34, 


dR 

= y( N -S-R). 
dt 

(7.39) 

Now combining Equations 7.38 and 7.39, 


dR p R 

— = y{N - S 0 e y - R). 
dt 

(7.40) 

If we solve this equation for R, then we can determine S from 7.38 and 
/ from I = N — R — S. The solution for R can be computed by solving 


the following integration: 
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t 


Figure 7.13. SIR Model Simulated with So = 99, I 0 = 1, R 0 = 0, ft = 0.01, and 
y = 0.1. 

However, there is no closed-form solution to this integration, and only 
numerical approximation is possible. Figure 7.13 depicts the behavior of 
the SIR model for a set of initial parameters. 

The two models in the next two subsections are generalized versions 
of the two models discussed thus far: SI and SIR. These models allow 
individuals to have temporary immunity and to get reinfected. 

7.4.4 SIS Model 

The SIS model is the same as the SI model, with the addition of infected 
nodes recovering and becoming susceptible again (see Figure 7.14). The 
differential equations describing the model are 


dS 

— = yl-pis, 

dt 

(7.42) 

dl 

— = piS — y I. 
at 

(7.43) 



Figure 7.14. SIS Model. 
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Figure 7.15. SIS Model Simulated with S 0 = 99, 7 0 = 1, [1 = 0.01, and y =0.1. 

By replacing S with TV — / in Equation 7.43, we arrive at 

d -j- = PW - I)-y I = I(PN - y ) - pi 2 . (7.44) 
dt 

When PN < y, the first term will be negative or zero at most; hence, 
the whole term becomes negative. Therefore, in the limit, the value /(f) 
will decrease exponentially to zero. However, when PN > y. we will have 
a logistic growth function as in the SI model. Having said this, as the 
simulation of the SIS model shows in Figure 7.15, the model will never 
infect everyone. It will reach a steady state, where both susceptibles and 
infecteds reach an equilibrium (see the epidemics exercises). 

7.4.5 SIRS Model 

The final model analyzed in this section is the SIRS model. Just as the 
SIS model extends the SI, the SIRS model extends the SIR, as shown in 
Figure 7.16. In this model, the assumption is that individuals who have 



Figure 7.16. SIRS Model. 
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recovered will lose immunity after a certain period of time and will become 
susceptible again. A new parameter has been added to the model 1 that 
defines the probability of losing immunity for a recovered individual. The 
set of differential equations that describe this model is 


dS 

dt 

dl 

dt 

dR 

dt 


XR- pIS, 


PIS — y I, 
yl-XR. 


( 7 . 45 ) 

( 7 . 46 ) 

( 7 . 47 ) 


Like the SIR model, this model has no closed-form solution, so numer¬ 
ical integration can be used. Figure 7.17 demonstrates a simulation of the 
SIRS model with given parameters of choice. As observed, the simulation 
outcome is similar to the SIR model simulation (see Figure 7.1 3). The major 
difference is that in the SIRS, the number of susceptible and recovered indi¬ 
viduals changes non-monotonically over time. For example, in SIRS, the 
number of susceptible individuals decreases over time, but after reaching 
the minimum count, starts increasing again. On the contrary, in the SIR, 
both susceptible individuals and recovered individuals change monotoni- 
cally, with the number of susceptible individuals decreasing over time and 
that of recovered individuals increasing over time. In both SIR and SIRS, 
the infected population changes non-monotonically. 


7.4.6 Intervention 

A pressing question in any pandemic or epidemic outbreak is how to stop 
the process. In this section, we discuss epidemic intervention based on a 
recent discovery by Nicholas A. Christakis [2004]. In any epidemic out¬ 
break, infected individuals infect susceptible individuals. Although in this 
chapter we discussed random infection in the real world, what actually 
takes place is quite different. Infected individuals have a limited number 
of contacts and can only infect them if said contacts are susceptible. A 
well-connected infected individual is more dangerous to the epidemic out¬ 
break than someone who has no contacts. In other words, the epidemic 
takes place in a network. Unfortunately, it is often difficult to trace these 
contacts and outline the contact network. If this was possible, the best way 
to intervene with the epidemic outbreak would be to vaccinate the highly 
connected nodes and stop the epidemic. This would result in what is known 
as herd immunity and would stop the epidemic outbreak. Herd immunity 
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Figure 7.17. SIRS Model Simulated with S 0 — 99, 1 0 = 1, R 0 = 0, y =0.1,^ = 0.01, 
and X = 0.02. 


entails vaccinating a population inside a herd such that the pathogen cannot 
initiate an outbreak inside the herd. In general, creating herd immunity 
requires at least a random sample of 96% of the population to be vacci¬ 
nated. Interestingly, we can achieve the same herd immunity by making use 
of friends in a network. In general, people know which of their friends have 
more friends. So, they know or have access to these higher degree and more 
connected nodes. Christakis found that if a random population of 30% of 
the herd is selected and then these 30% are asked for their highest degree 
friends, one can achieve herd immunity by vaccinating these friends. Of 
course, older intervention techniques such as separating those infected from 
those susceptible (quarantining them) or removing those infected (killing 
cows with mad cow disease) still work. 

7.5 Summary 

In this chapter, we discussed the concept of information diffusion in social 
networks. In the herd behavior, individuals observe the behaviors of others 
and act similarly to them based on their own benefit. We reviewed the 
well-known diners example and urn experiment and demonstrated how 
conditional probabilities can be used to determine why herding takes place. 
We discussed how herding experiments should be designed and ways to 
intervene with it. 
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Next, we discussed the information cascade problem with the constraint 
of sequential decision making. The independent cascade model (ICM) is 
a sender-centric model and has a level of stochasticity associated with it. 
The spread of cascades can be maximized in a network given a budget 
on how many initial nodes can be activated. Unfortunately, the problem 
is NP-hard; therefore, we introduced a greedy approximation algorithm 
that has guaranteed performance due to the submodularity of ICM’s acti¬ 
vation function. Finally, we discussed how to intervene with information 
cascades. 

Our next topic was the diffusion of innovations. We discussed the char¬ 
acteristics of adoption both from the individual and innovation point of 
view. We reviewed well-known theories such as the models introduced by 
Ryan and Gross, Katz, and Rogers, in addition to experiments in the field, 
and different types of adopters. We also detailed mathematical models that 
account for internal, external, and mixed influences and their intervention 
procedures. 

Finally, we moved on to epidemics, an area where decision making is 
usually performed unconsciously. We discussed four epidemic models: SI, 
SIR, SIS, and SIRS; the two last models allow for reinfected individu¬ 
als. For each model we provided differential equations, numerical solu¬ 
tions, and closed-form solutions, when available. We concluded the chapter 
with intervention approaches to epidemic outbreaks and a review of herd 
immunity in epidemics. Although a 96% random vaccination is required 
for achieving herd immunity, it is also possible to achieve it by vaccinat¬ 
ing a random population of 30% and the vaccinating their highest degree 
friends. 


7.6 Bibliographic Notes 

The concept of the herd has been well studied in psychology by Freud 
(crowd psychology), Carl Gustav Jung (the collective unconscious), and 
Gustave Le Bon (the popular mind). It has also been observed in economics 
by Veblen [ 1899] and in studies related to the bandwagon effect [Rohlfs 
and Varian, 2003; Simon, 1954; Leibenstein, 1950], The behavior is also 
discussed in terms of sociability [Simmel and Hughes, 1949] in sociology. 

Herding, first coined by Banerjee [1992], at times refers to a slightly 
different concept. In herd behaviour discussed in this chapter, the crowd 
does not necessarily start with the same decision, but will eventually reach 
one, whereas in herding the same behavior is usually observed. Moreover, 
in herd behavior, individuals decide whether the action they are taking has 
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some benefits to themselves or is rational, and based on that, they will align 
with the population. In herding, some level of uncertainty is associated with 
the decision, and the individual does not know why he or she is following 
the crowd. 

Another confusion is that the terms “herd behavior/herding” is often used 
interchangebly with “information cascades” [Bikhchandani et al., 1992; 
Welch, 1992], To avoid this problem, we clearly define both in the chapter 
and assume that in herd behavior decisions are taken based on global 
information, whereas in information cascades, local information is utilized. 

Herd behavior has been studied in the context of financial markets [Cont 
and Bouchaud, 2000; Drehmann et al., 2005; Bikhchandani and Sharma, 
2001; Devenow and Welch, 1996] and investment [Scharfstein and Stein, 
1990]. Gale analyzes the robustness of different herd models in terms of 
different constraints and externalities [Gale, 1996], and Shiller discusses 
the relation between information, conversation, and herd behavior [Shiller, 
1995]. Another well-known social conformity experiment was conducted 
in Manhattan by Milgram et al. [1969]. 

Other recent applications of threshold models can be found in [Young, 
1988; Watts, 1999, 2002; Valente, 1995, 1996a; Schelling, 1978; Peleg, 
1997; Morris, 2000; Macy and Wilier, 2002; Macy, 1991; Granovetter, 
1976; Berger, 2001]. Bikhchandani et al. [1998] review conformity, fads, 
and information cascades and describe how observing past human decisions 
can help explain human behavior. Hirshleifer [1997] provides information 
cascade examples in many fields, including zoology and finance. 

In terms of diffusion models, Robertson [1967] describes the process and 
Hagerstrand [1967] introduces a model based on the spatial stages of the 
diffusion of innovations and Monte Carlo simulation models for diffusion of 
innovations. Bass [1969] discusses a model based on differential equations. 
Mahajan and Peterson [1978] extend the Bass model. 

Instances of external-influence models can be found in [Hamblin et al., 
1973; Coleman et al., 1966] and internal-influence models are applied in 
[Mansfield, 1961; Griliches, 2007; Gray, 2007], The Gompertz function 
[Martino, 1983], widely used in forecasting, has a direct relationship with 
the internal-influence diffusion curve. Mixed-influence model examples 
include the work of Mahajan and Muller [1982] and Bass model [Bass, 
1969], 

Midgley and Dowling [1978] introduce the contingency model. Abra- 
hamson et al. ? mathematically analyze the bandwagon effect and diffusion 
of innovations. Their model predicts whether the bandwagon effect will 
occur and how many organizations will adopt the innovation. Network 
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models of diffusion and thresholds for diffusion of innovations models are 
discussed by Valente [1996a, 1996b]. Diffusion through blogspace and in 
general, social networks, has been analyzed by [Gruhl et al., 2004; Leskovec 
et al., 2007; Yang and Leskovec, 2010; Zafarani, Cole, and Liu, 2010], 

For information on different pandemics, refer to [Nohl, 2006; Bell, 1995; 
Patterson and Runge, 2002; Des Jarlais et al., 1994; Dye and Gay, 2003; 
Chinese et al., 2004; Guan et al., 2007; Nelson and Holmes, 2007]. To 
review some early and in-depth analysis of epidemic models, refer to 
[Bailey et al., 1975; Anderson and May, 1991]. Surveys of epidemics can 
be found in [Hethcote, 1994, 2000; Hethcote et al., 1981; Dietz, 1967], 
Epidemics in networks have been discussed [Newman, 2010; Moore and 
Newman, 1999; Keeling and Eames, 2005] extensively. Other general 
sources include [Lewis, 2009; Easley and Kleinberg, 2010; Newman, 2010; 
Barrat et al., 2008], A generalized model for contagion is provided by 
Dodds and Watts [2004] and, in the case of best response dynamics, in 
Morris [2000]. 

Other topics related to this chapter include wisdom of crowd models 
[Golub and Jackson, 2010] and swarm intelligence [Eberhart et al., 2001; 
Engelbrecht, 2005; Bonabeau et al., 1999; Kennedy, 2006], One can also 
analyze information provenance, which aims to identify the sources from 
which information has diffused. Barbier et al. [2013] provide an overview 
of information provenance in social media in their book. 

7.7 Exercises 

1. Discuss how different information diffusion modeling techniques dif¬ 
fer. Name applications on social media that can make use of methods 
in each area. 


Herd Effect 

2. What are the minimum requirements for a herd behavior experiment? 
Design an experiment of your own. 


Diffusion of Innovation 

3. Simulate internal-, external-, and mixed-influence models in a pro¬ 
gram. How are the saturation levels different for each model? 

4. Provide a simple example of diffusion of innovations and suggest a 
specific way of intervention to expedite the diffusion. 
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Information Cascades 

5. Briefly describe the independent cascade model (ICM). 

6. What is the objective of cascade maximization? What are the usual 
constraints? 

7. Follow the ICM procedure until it converges for the following graph. 
Assume that node i activates node j when i — j = 1 (mod 3) and 
node 5 is activated at time 0. 



8. Discuss the mathematical relationship between the SIR and the SIS 
models. 

9. Based on our assumptions in the SIR model, the probability that an 
individual remains infected follows a standard exponential distribu¬ 
tion. Describe why this happens. 

10. In the SIR model, what is the most likely time to recover based on 
the value of yl 

11. In the SIRS model, compute the length of time that an infected 
individual is likely to remain infected before he or she recovers. 

12. After the model saturates, how many are infected in the SIS model? 











Part III 

Applications 
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Influence and Homophily 


Social forces connect individuals in different ways. When individuals get 
connected, one can observe distinguishable patterns in their connectivity 
networks. One such pattern is assortativitv, also known as social similarity. 
In networks with assortativity, similar nodes are connected to one another 
more often than dissimilar nodes. For instance, in social networks, a high 
similarity between friends is observed. This similarity is exhibited by sim¬ 
ilar behavior, similar interests, similar activities, and shared attributes such 
as language, among others. In other words, friendship networks are assor- 
tative. Investigating assortativity patterns that individuals exhibit on social 
media helps one better understand user interactions. Assortativity is the 
most commonly observed pattern among linked individuals. This chapter 
discusses assortativity along with principal factors that result in assortative 
networks. 

Many social forces induce assortative networks. Three common forces 
are influence, homophily, and confounding. Influence is the process by 
which an individual (the influential) affects another individual such that 
the influenced individual becomes more similar to the influential figure. 
Homophily is observed in already similar individuals. It is realized when 
similar individuals become friends due to their high similarity. Confound¬ 
ing is the environment’s effect on making individuals similar. For instance, 
individuals who live in Russia speak Russian fluently because of the envi¬ 
ronment and are therefore similar in language. The confounding force is 
an external factor that is independent of inter-individual interactions and is 
therefore not discussed further. 

Note that both influence and homophily social forces give rise to assor¬ 
tative networks. After either of them affects a network, the network exhibits 
more similar nodes; however, when “friends become similar,” we denote 
that as influence, and when “similar individuals become friends,” we call 
it homophily. Figure 8.1 depicts how both influence and homophily affect 
social networks. 


ASSORTATIVITY 
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Figure 8.1. Influence and Homophily. 


In particular, when discussing influence and homophily in social media, 
we are interested in asking the following questions: 

• How can we measure influence or homophily? 

• How can we model influence or homophily? 

• How can we distinguish between the two? 

Because both processes result in assortative networks, we can quantify 
their effect on the network by measuring the assortativity of the network. 


8.1 Measuring Assortativity 

Measuring assortativity helps quantify how much influence and homophily, 
among other factors, have affected a social network. Assortativity can be 
quantified by measuring how similar connected nodes are to one another. 
Figure 8.2 depicts the friendship network in a U.S. high school in 1994. 1 
In the figure, races are represented with different colors: whites are white, 
blacks are gray, Hispanics are light gray, and others are black. As we observe, 
there is a high assortativity between individuals of the same race, particu¬ 
larly among whites and among blacks. Hispanics have a high tendency to 
become friends with whites. 

To measure assortativity, we measure the number of edges that fall in 
between the nodes of the same race. This technique works for nominal 
attributes, such as race, but does not work for ordinal ones such as age. 
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V 



Figure 8.2. A U.S. High School Friendship Network in 1994 between Races. Eighty 
percent of the links exist between members of the same race (from [Currarini et al., 
2009]). 

Consider a network where individuals are friends with people of different 
ages. Unlike races, individuals are more likely to be friends with others 
close in age, but not necessarily with ones of the exact same age. Hence, 
we discuss two techniques: one for nominal attributes and one for ordinal 
attributes. 


8.1.1 Measuring Assortativity for Nominal Attributes 

Consider a scenario where we have nominal attributes assigned to nodes. 
As in our example, this attribute could be race or nationality, gender, or 
the like. One simple technique to measure assortativity is to consider the 
number of edges that are between nodes of the same type. Let ?(», ) denote 
the type of node o In an undirected graph 2 , G(V, E ), with adjacency 
matrix A, this measure can be computed as follows, 


— 51 t>'( t(yi), t(Vj) ) 

m (Vi,Vj)aE 


1 

2m 


5Z A ‘J S(t(Vi),t(Pj)), 
ij 


( 8 . 1 ) 


where m is the number of edges in the graph, ^ is applied for normalization, 
and the factor 2 is added because G is undirected. <)'(.,.) is the Kronecker 
delta function: 


$(x,y) 


0, if x f y; 
1, if x = y. 


( 8 . 2 ) 
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ASSORTATIVITY 

SIGNIFICANCE 


MODULARITY 


This measure has its limitations. Consider a school of Hispanic students. 
Obviously, all connections will be between Hispanics, and assortativity 
value 1 is not a significant finding. However, consider a school where half 
the population is white and half the population is Hispanic, ft is statisti¬ 
cally expected that 50% of the connections will be between members of 
different race. If connections in this school were only between whites and 
Hispanics, then our finding is significant. To account for this issue, we can 
employ a common technique where we measure the assortativity signifi¬ 
cance by subtracting the measured assortativity by the statistically expected 
assortativity. The higher this value, the more significant the assortativity 
observed. 

Consider a graph G(V,E), \E\ = m, where the degrees are known 
beforehand (how many friends an individual has), but the edges are not. 
Consider two nodes o, and vj, with degrees d, and ^respectively. What is 
the expected number of edges between these two nodes? Consider node . 
For any edge going out of v, randomly, the probability of this edge getting 
connected to node vj is = ly ■ Since the degree for v t is d, , we have 

di such edges; hence, the expected number of edges between v, and vj is 
. Now, the expected number of edges between v l and vj that are of the 
same type is ^<5( t(vf, t(vj )) and the expected number of edges of the 
same type in the whole graph is 


1 

m 


E 

( Vi,Dj)eE 


didj 
2m 


d{t{Vi),t{Vj)) 


1 % - di dj 

2m 2m 


ij 


S(t(vi),t(vj)). 


(8.3) 


We are interested in computing the distance between the assortativity 
observed and the expected assortativity: 


Q = E £ A ij t(Vi), t(v /)) 

ij 


1 r—> di dj 

2m 2m 

•J 


S(t(vi),t(vj)) 
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2m 




djdj 

2m 


)<5( tfii), t(vj )). 


(8.4) 

(8.5) 


This measure is called modularity Newman [2006]. The maximum mod¬ 
ularity value for a network depends on the number of nodes of the same type 
and degree. The maximum occurs when all edges are connecting nodes of 
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the same type (i.e., when Ajj = I, S( /(«,), t(vj) ) = 1). We can normalize 
modularity by dividing it by the maximum it can take: 


G normalized — 
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Qn 


( 8 . 6 ) 
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( 8 . 8 ) 

(8.9) 


Modularity can be simplified using a matrix format. Let A e R" x * denote 
the indicator matrix and let k denote the number of types, 


(l, if t(x) = k; 
\ 0, if t(x) / k 


( 8 . 10 ) 


Note that <5 function can be reformulated using the indicator matrix: 

S(t(v i ),t(v j )) = J2 A v i ,kA Vj , k . (8.11) 

k 


Therefore, (A A T \j = 3(t(Vj), t(oj)). Let B = A — dd r /2m denote the 
modularity matrix where d e R' !X 1 is the degree vector for all nodes. Given 
that the trace of multiplication of two matrices X and Y T is Tr(XY T ) = 
'EijXtjYij and Tr(XY) = Tr(YX), modularity can be reformulated 
as 


Q = ( A ‘j ~ ~A~ ) S ( t(v ‘)’ dVj) ) = 


IJ v — 


2m 


(AA T \ 


= —Tr(A r 5A). 
2m 


1 T 

—Tr(5AA r ) 
2m 


( 8 . 12 ) 
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Figure 8.3. A Modularity Example for a Bipartite Graph. 


Example 8.1. Consider the bipartite graph in Figure 8.3. For this bipartite 
graph, 


'0 o i r 

, 0 0 11 

A= 1 1 0 0 

. 1100 . 

Therefore, matrix B is 


B = A- dd r /2m = 


‘1 
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2 
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0.5 

0.5 

-0.5 
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m = 4. 

(8.13) 


(8.14) 


The modularity value Q is 

1 T 

— Tr(A T BA) = -0.5. 
2 m 


(8.15) 


In this example, all edges are between nodes of different color. In other 
words, the number of edges between nodes of the same color is less than the 
expected number of edges between them. Therefore, the modularity value 
is negative. 


8.1.2 Measuring Assortativity for Ordinal Attributes 

A common measure for analyzing the relationship between two variables 
with ordinal values is covariance. Covariance describes how two variables 
change with respect to each other. In our case, we are interested in how 
correlated, the attribute values of nodes connected via edges are. Let x, be 
the ordinal attribute value associated with node w, . In Figure 8.4, for node 
c, the value associated is x c = 21. 
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18 21 
©—© 

(IT) 20 

Figure 8.4. A Correlation Example. 

We construct two variables X, and X R , where for any edge ( v t ,Vj ) 
we assume that x, is observed from variable X L and xj is observed from 
variable X R . For Figure 8.4, 


" 18 ' 


'21' 

21 

, Xr = 

18 

21 

20 

.20. 


.21. 


In other words, X L represents the ordinal values associated with the left 
node of the edges, and X R represents the values associated with the right 
node of the edges. Our problem is therefore reduced to computing the 
covariance between variables X L and X R . Note that since we are considering 
an undirected graph, both edges v/) and (vj , v,-) exist; therefore, x, and 
xj are observed in both X L and X R . Thus, X, and X R include the same 
set of values but in a different order. This implies that X L and X R have the 
same mean and standard deviation. 

E {X l ) = E(X r ), (8.17) 

a(X L ) = a(X R ). (8.18) 

Since we have m edges and each edge appears twice for the undirected 
graph, then X L and X R have 2m elements. Each value jc,- appears d, times 
since it appears as endpoints of d, edges. The covariance between X, and 
Xr is 


<?{X L ,X R ) = E[(X L - E[X l ])(X r - E[X*])] 

= E[X l X r - X,E[X r ] - E[X l ]X r + E[X l ]E[X r ]] 

= E[X l X r ] - E[X l ]E[X r ] - E[X l ]E[X r ] + E[X l ]E[X r \ 

= E[X l X r ] - E[Xi]E[X fi ], (8.19) 

E{X l ) is the mean (expected value) of variable X L , and E(X/ X K ) is 
the mean of the multiplication of X/ and X R . In our setting and following 
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Equation 8.17, these expectations are as follows: 


E(X l ) = E(X r ) = 


E(X l X r ) = — ^(XlUXr), = 


= E,- djXj 

2m 2m 

£y A ij x i x J 


2 m 


( 8 . 20 ) 

( 8 . 21 ) 


l 

By plugging Equations 8.20 and 8.21 into Equation 8.19, the covariance 
between X L and X R is 


°{X l ,Xr) = E[X L X R \ - E[X L ]E[X R \ 

_ E ij ■■hjXi.Xj E ij djdjXiXj 
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( 8 . 22 ) 


Similar to modularity (Section 8.1.1), we can normalize covariance. 
Pearson correlation p(X L , X R ) is the normalized version of covariance: 


p(X L ,X R ) = 


o(X L ,X R ) 


a(X L )a(X R ) 
From Equation 8.18, a(X R ) = a(X R )\ thus, 
o(X L ,X R ) 


P(X L ,X R ) = 
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it] 2m A,X J 


(8.24) 


Note the similarity between Equations 8.9 and 8.24. Although modularity 
is used for nominal attributes and correlation for ordinal attributes, the major 
difference between the two equations is that the S function in modularity is 
replaced by XjXj in the correlation equation. 


Example 8.2. Consider Figure 8.4 with values demonstrating the attributes 
associated with each node. Since this graph is undirected, we have the 
following edges: 


E = {(a, c), (c, a), (c, b ), (b, c)}. 


(8.25) 
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The correlation is between the values associated with the endpoints of 
the edges. Consider Xr as the value of the left end of an edge and Xr as 
the value of the right end of an edge: 


" 18 ‘ 


'21' 

21 

? Xr — 

18 

21 

20 

.20. 


.21. 


(8.26) 


The correlation between these two variables is p(Xr,Xr) = —0.67. 

8.2 Influence 

Influence ' is “the act or power of producing an effect without apparent 
exertion of force or direct exercise of command.” In this section, we discuss 
influence and, in particular, how we can (1) measure influence in social 
media and (2) design models that concretely detail how individuals influence 
one another in social media. 


8.2.1 Measuring Influence 


Influence can be measured based on (1) prediction or (2) observation. 

Prediction-Based Measures. In prediction-based measurement, we 
assume that an individual’s attribute or the way she is situated in the net¬ 
work predicts how influential she will be. For instance, we can assume that 
the gregariousness (e.g., number of friends) of an individual is correlated 
with how influential she will be. Therefore, it is natural to use any of the 
centrality measures discussed in Chapter 3 for prediction-based influence 
measurements. Examples of such centrality measures include PageRank 
and degree centrality. In fact, many of these centrality measures were 
introduced as influence-measuring techniques. For instance, on Twitter, 
in-degree (number of followers) is a common attribute for measuring 
influence. Since these methods were covered in-depth in that chapter, in 
this section we focus on observational techniques. 
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Observation-Based Measures. In observation-based measures, we quan¬ 
tify the influence of an individual by measuring the amount of influence 
attributed to him. An individual can influence differently in diverse settings, 
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and so, depending on the context, the observation-based measuring of influ¬ 
ence changes. We next describe three different settings and how influence 
can be measured in each. 

1. When an individual is the role model. This happens in the case of 
individuals in the fashion industry, teachers, and celebrities. In this 
case, the size of the audience that has been influenced due to that 
fashion, charisma, or the like could act as an accurate measure. A 
local grade-school teacher has a tremendous influence over a class 
of students, whereas Gandhi influenced millions. 

2. When an individual spreads information. This scenario is more 
likely when a piece of information, an epidemic, or a product is 
being spread in a network. In this case, the size of the cascade - 
that is, the number of hops the information traveled - or the popu¬ 
lation affected, or the rate at which population gets influenced is 
considered a measure. 

3. When an individual increases value. As in the case of diffusion of 
innovations (see Chapter 7), often when individuals perform actions 
such as buying a product, they increase the value of the product for 
other individuals. For example, the first individual who bought a fax 
machine had no one to send faxes to. The second individual who 
bought a fax machine increased its value for the first individual. So, 
the increase (or rate of increase) in the value of an item or action 
(such as buying a product) is often used as a measure. 


Case Studies for Measuring Influence in Social Media 

This section provides examples of measuring influence in the blogosphere 
and on the micrologging site Twitter. These techniques can be adapted to 
other social media sites, as well. 


Measuring Social Influence in the Blogosphere 

The goal of measuring influence in the blogosphere is to identify influen¬ 
tial bloggers. Due to the limited time that individuals have, following the 
influentials is often necessary for fast access to interesting news. One com¬ 
mon measure for quantifying the influence of bloggers is to use in-degree 
centrality: the number of (in-)links that point to the blog. However, because 
of the sparsity of in-links, more detailed analysis is required to measure 
influence in the blogosphere. 
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In their book. The Influentials: One American in Ten Tells the Other 
Nine How to Vote, Where to Eat, and What to Buy. Keller and Berry [2003] 
argue that influentials are individuals who (1) are recognized by others, (2) 
whose activities result in follow-up activities, (3) have novel perspectives, 
and (4) are eloquent. 

To address these issues, Agarwal et al. [2008] proposed the iFinder 
system to measure influence of blogposts and to identify influential bloggers. 
In particular, for each one of these four characteristics and a blogpost p, they 
approximate the characteristic by collecting specific blogpost’s attributes: 

1. Recognition. Recognition for a blogpost can be approximated by the 
links that point to the blogpost (in-links). Let X p denote the set of 
in-links that point to blogpost p. 

2. Activity Generation. Activity generated by a blogpost can be esti¬ 
mated using the number of comments that p receives. Let c p denote 
the number of comments that blogpost p receives. 

3. Novelty. The blogpost’s novelty is inversely correlated with the num¬ 
ber of references a blogpost employs. In particular the more citations 
a blogpost has, the less novel it is. Let O p denote the set of out-links 
for blogpost p. 

4. Eloquence. Eloquence can be estimated by the length of the blogpost. 
Given the informal nature of blogs and the bloggers’ tendency to write 
short blogposts, longer blogposts are commonly believed to be more 
eloquent. So, the length of a blogpost l p can be employed as a measure 
of eloquence. 

Given these approximations for each one of these characteristics, we 
can design a measure of influence for each blogpost. Since the number of 
out-links inversely affects the influence of a blogpost and the number of 
in-links increases it, we construct an influence graph, or i-graph , where 
blogposts are nodes and influence flows through the nodes. The amount of 
this influence flow for each post p can be characterized as 

\i P \ I o P i 

InfluenceFlow(p) = it>j n Wout ^ ' I(Pn\ (8.27) 

m =1 n= 1 

where /(.) denotes the influence of a blogpost and w m and w out are the 
weights that adjust the contribution of in- and out-links, respectively. In this 
equation, P m ’s are blogposts that point to post p, and P„’s are blogposts 
that are referred to in post p. Influence flow describes a measure that only 
accounts for in-links (recognition) and out-links (novelty). To account for 
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the other two factors, we design the influence of a blogpost p as 

I ip) = tOlength/p(WcommentCp + IltfluenceFlow(p)). (8.28) 

Here, toi eng th is the weight for the length of blogpost 4 . «j CO mment describes 
how the number of comments is weighted. Note that the four weights 
uijn, w out , w comm ents, and wi en gth need to be tuned to make the model more 
accurate. This tuning can be done by a variety of techniques. For instance, we 
can use a test system where the influential posts are already known (labeled 
data) to tune them. Finally, a blogger’s influence index ( ilndex ) can be 
defined as the maximum influence value among all his or her N blogposts, 

ilndex = max /(/>„). (8.29) 

p„€N 

Computing ilndex for a set of bloggers over all their blogposts can help 
identify and rank influential bloggers in a system. 

Measuring Social Influence on Twitter. On Twitter, a microblogging 
platform, users receive tweets from other users by following them. Intu¬ 
itively, we can think of the number of followers as a measure of influence 
(in-degree centrality). In particular, three measures are frequently used to 
quantify influence in Twitter, 

1. In-degree: the number of users following a person on Twitter. As 
discussed, the number of individuals who are interested in someone’s 
tweets (i.e., followers) is commonly used as an influence measure on 
Twitter. In-degree denotes the “audience size” of an individual. 

2. Number of mentions: the number of times an individual is men¬ 
tioned in tweets. Mentioning an individual with a username handle 
is performed by including ©username in a tweet. The number of 
times an individual is mentioned can be used as an influence mea¬ 
sure. The number of mentions denotes the “ability in engaging others 
in conversation” [Cha et al., 2010], 

3. Number of retweets: the number of times tweets of a user are 
retweeted. Individuals on Twitter have the opportunity to forward 
tweets to a broader audience via the retweet capability. Clearly, the 
more one’s tweets are retweeted, the more likely one is influential. 
The number of retweets indicates an individual’s ability to generate 
content that is worth being passed along. 

Each one of these measures by itself can be used to identify influen¬ 
tial users in Twitter. This can be done by utilizing the measure for each 
individual and then ranking individuals based on their measured influence 
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Table 8.1. Rank Correlation between Top 10% 
of Influentials for Different Measures on Twitter 


Measures 

Correlation Value 

In-degree vs. retweets 

0.122 

In-degree vs. mentions 

0.286 

Retweets vs. mentions 

0.638 


value. Contrary to public perception, the number of followers is considered 
an inaccurate measure compared to the other two. This is shown in [Cha 
et al., 2010], where the authors ranked individuals on Twitter independently 
based on these three measures. To see if they are correlated or redundant, 
they compared ranks of individuals across three measures using rank cor¬ 
relation measures. One such measure is the Spearman’s rank correlation 
coefficient, 


P = 1 - 


6E"=i( ffl 'i ~ m 2) 2 

« 3 — n 


(8.30) 


where m\ and m\ are ranks of individual i based on measures m \ and m 2 , 
and n is the total number of users. Spearman’s rank correlation is the Pearson 
correlation coefficient for ordinal variables that represent ranks (i.e., takes 
values between 1... n); hence, the value is in range [—1,1]. Their findings 
suggest that popular users (users with high in-degree) do not necessarily 
have high ranks in terms of number of retweets or mentions. This can be 
observed in Table 8.1, which shows the Spearman’s correlation between the 
top 10% influentials for each measure. 


8.2.2 Modeling Influence 

In influence modeling, our goal is to design models that can explain how 
individuals influence one another. Given the nature of social media, it is 
safe to assume that influence takes place among connected individuals. At 
times, this network is observable (explicit networks), and at others times, it is 
unobservable (implicit networks). For instance, in referral networks, where 
people refer others to join an online service on social media, the network 
of referrals is often observable. In contrast, people are influenced to buy 
products, and in most cases, the seller has no information on who referred 
the buyer, but does have approximate estimates on the number of products 
sold over time. In the observable (explicit) network, we resort to threshold 
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models such as the linear threshold model (LTM) to model influence; in 
implicit networks, we can employ methods such as the linear influence 
model (LIM) that take the number of individuals who get influenced at 
different times as input (e.g., the number of buyers per week). 


Modeling Influence in Explicit Networks 

Threshold models are simple yet effective methods for modeling influence 
in explicit networks. In these models, nodes make decision based on the 
number or the fraction (the threshold) of their neighbors (or incoming 
neighbors in a directed graph) who have already decided to make the same 
decision. Threshold models were employed in the literature as early as the 
1970s in the works of Granovetter [1983] and Schelling [1971], Using a 
threshold model, Schelling demonstrated that minor local preferences in 
having neighbors of the same color leads to global racial segregation. 

A linear threshold model (LTM) is an example of a threshold model. 
Assume a weighted directed graph where nodes vj and ip are connected 
with weight Wjj > 6. This weight denotes how much node uj can affect 
node Vi ’s decision. We also assume 

E «>J.i < 1- (8.31) 

O^eJVin(Oi) 

where /V m (t),) denotes the incoming neighbors of node r > l . In a linear thresh¬ 
old model, each node u, is assigned a threshold 0, such that when the amount 
of influence exerted toward u, by its active incoming neighbors is more than 
&i, then Vi becomes active, if still inactive. Thus, for v, to become active at 
time t, we should have 

E v>],i > 0 U (8.32) 

L/ £ V ln ( Vi).1> j € A f 1 

where A t ~i denotes the set of active nodes at the end of time t — 1. The 
threshold values are generally assigned uniformly at random to nodes from 
the interval [0,1]. Note that the threshold 0, defines how resistant to change 
node Vi is: a very small 0, value might indicate that a small change in the 
activity of ip ’s neighborhood results in v t becoming active and a large 0, 
shows that v t resists changes. 

Provided a set of initial active nodes Aq and a graph, the LTM algorithm 
is shown in Algorithm 8.1. In each step, for all inactive nodes, the condition 
in Equation 8.32 is checked, and if it is satisfied, the node becomes active. 
The process ends when no more nodes can be activated. Once 6 thresholds 
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Algorithm 8.1 Linear Threshold Model (LTM) 

Require: Graph G( V, E ), set of initial activated nodes A 0 
l: return Final set of activated nodes 
2: i= 0 ; 

3: Uniformly assign random thresholds 9„ from the interval [0, 1]; 
4: while / = 0 or (A t _ i A t , i > 1) do 

5: Ai-^i = Aj 

6 : inactive = V — A,-; 

7: for all v e inactive do 

8: if£, connected to v, j e A t wj tV > 9 U . then 

9: activate v; 

10 : Aj^-i = A i+ i U {n}, 

it: end if 

12 : end for 

13: i = i -f 1; 

14: end while 
15: A oo = Ai , 

16 : Return A^; 


are fixed, the process is deterministic and will always converge to the same 
state. 

Example 8.3. Consider the graph in Figure 8.5. Values attached to nodes 
represent the LTM thresholds, and edge values represent the weights. At time 
0, node v\ is activated. At time 2, both nodes »2 and vj, receive influence 
from node v\. Node v >2 is not activated since 0.5 < 0.8 and node 03 is 
activated since 0.8 > 0.7. Similarly, the process continues and then stops 
with five activated nodes. 


Modeling Influence in Implicit Networks 

An implicit network is one where the influence spreads over edges in 
the network; however, unlike the explicit model, we cannot observe the 
individuals (the influentials) who are responsible for influencing others, but 
only those who get influenced. In other words, the information available 
is the set of influenced population P(t) at any time and the time t u , when 
each individual u gets initially influenced (activated). We assume that any 
influenced individual u can influence I(u, t) number of non-influenced 
(inactive) individuals after t time steps. We call /(.,.) the influence function. 
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Figure 8.5. Linear Threshold Model (LTM) Simulation. The values attached to nodes 
denote thresholds (9,, and the values on the edges represent weights w t j. 


Assuming discrete time steps, we can formulate the size of the influenced 
population \P(t)\: 


\P(t)\ = Y, J ( u ’ f ~ f u\ (8.33) 

ueP(t) 

Figure 8.6 shows how the model performs. Individuals u, d, and w are 
activated at time steps 4,4, and t w , respectively. At time t, the total number 
of influenced individuals is a summation of influence functions /„, /„, and I w 
at time steps t — 4, t — 4, and t — t w , respectively. Our goal is to estimate 
/(.,.) given activation times and the number of influenced individuals at all 
times. A simple approach is to utilize a probability distribution to estimate 
/ function. For instance, we can employ the power-law distribution to 
estimate influence. In this case, I(u, t) = c u (t — 4) -a ", where we estimate 
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Figure 8.6. The Size of the Influenced Population as a Summation of Individuals Influ¬ 
enced by Activated Individuals (from [Yang and Leskovec, 2010]). 

coefficients c„ and a u for any u by methods such as maximum likelihood 
estimation (see [Myung, 2003] for more details). 

This is called the parametric estimation, and the method assumes that 
all users influence others in the same parametric form. A more flexible 
approach is to assume a nonparametric function and estimate the influence 
function’s form. This approach was first introduced as the linear influence 
model (LIM) [Yang and Leskovec, 2010], 

In LIM, we extend our formulation by assuming that nodes get deacti¬ 
vated over time and then no longer influence others. Let A(u, t ) = 1 denote 
that node u is active at time t, and A(u, t) = 0 denote that node u is either 
deactived or still not influenced. Following a network notation and assum¬ 
ing that | V\ is the total size of the population and T is the last time step, we 
can reformulate Equation 8.33 for P(t)\ as 


LINEAR 
INFLUENCE 
MODEL (LIM) 


\V\ T 



( 8 . 34 ) 


or equivalently in matrix form. 


P = AI. 


( 8 . 35 ) 


It is common to assume that individuals can only activate other indi¬ 
viduals and cannot stop others from becoming activated. Hence, negative 
values for influence do not make sense; therefore, we would like measured 
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influence values to be positive I > 0, 

minimize \\P — AI\\\ (8.36) 

subject to I > 0. (8.37) 

This formulation is similar to regression coefficient computation outlined 
in Chapter 5, where we compute a least square estimate of /; however, this 
formulation cannot be solved using regression techniques studied earlier 
because, in regression, computed / values can become negative. In practice, 
this formulation can be solved using non-negative least square methods (see 
[Lawson and Hanson, 1995] for details). 

8.3 Homophily 

Homophily is the tendency of similar individuals to become friends. It 
happens on a daily basis in social media and is clearly observable in social 
networking sites where befriending can explicitly take place. The well- 
known saying, “birds of a feather flock together” is frequently quoted when 
discussing homophily. Unlike influence, where an influential influences 
others, in homophily, two similar individuals decide to get connected. 

8.3.1 Measuring Homophily 

Homophily is the linking of two individuals due to their similarity and leads 
to assortative networks over time. To measure homophily, we measure how 
the assortativity of the network has changed over time." Consider two 
snapshots of a network G n (V, E h ) and G, 2 (V, E h ) at times t\ and t 2 , 
respectively, where t 2 > h ■ Without loss of generality, we assume that the 
number of nodes is fixed and only edges connecting these nodes change 
(i.e., are added or removed). 

When dealing with nominal attributes, the homophily index is defined 
as 


H = 


Q 


tl 

normalized 


Q 


t\ 

normalized 5 


(8.38) 


where (^normalized is defined in Equation 8.9. Similarly, for ordinal attributes, 
the homophily index can be defined as the change in the Pearson correlation 
(Equation 8.24): 


H = p h -p h . 


(8.39) 
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Algorithm 8.2 Homophily Model 

Require: Graph G(V, E ), E = 0 , similarities sim(v, u ) 
l: return Set of edges E 

2 : for all v e V do 

3: 6 V = generate a random number in [0,1]; 

4: for all (v ,u)gE do 

5: if 8 U < sim(u, u) then 

6: E = E U (v, u); 

7: end if 

8 : end for 

9: end for 

10 : Return E; 


8.3.2 Modeling Homophily 

Homophily can be modeled using a variation of the independent cascade 
model discussed in Chapter 7. In this variation, at each time step a single 
node gets activated, and the activated node gets a chance of getting con¬ 
nected to other nodes due to homophily. In other words, if the activated 
node finds other nodes in the network similar enough (i.e., their similarity 
is higher than some tolerance value), it connects to them via an edge. A 
node once activated has no chance of getting activated again. 

Modeling homophily is outlined in Algorithm 8.2. Let sim{u, v) denote 
the similarity between nodes u and v. When a node gets activated, we 
generate a random tolerance value for the node v between 0 and 1. Alter¬ 
natively, we can set this tolerance to some predefined value. The tolerance 
value defines the minimum similarity that node v tolerates for connecting 
to other nodes. Then, for any likely edge (v, u ) that is still not in the edge 
set, if the similarity is sim(v, u ) > 9 V , the edge (v, u) is added. The process 
continues until all nodes are activated. 

The model can be used in two different scenarios. First, given a net¬ 
work in which assortativity is attributed to homophily, we can estimate 
tolerance values for all nodes. To estimate tolerance values, we can sim¬ 
ulate the homophily model in Algorithm 8.2 on the given network with 
different tolerance values and by removing edges. We can then compare the 
assortativity of the simulated network and the given network. By finding 
the simulated network that best fits the given network (i.e., has the closest 
assortativity value to the given network’s assortativity), we can determine 
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the tolerance values for individuals. Second, when a network is given 
and the source of assortativity is unknown, we can estimate how much 
of the observed assortativity can be attributed to homophily. To mea¬ 
sure assortativity due to homophily, we can simulate homophily on the 
given network by removing edges. The distance between the assortativity 
measured on the simulated network and the given network explains how 
much of the observed assortativity is due to homophily. The smaller this 
distance, the higher the effect of homophily in generating the observed 
assortativity. 


8.4 Distinguishing Influence and Homophily 

We are often interested in understanding which social force (influence or 
homophily) resulted in an assortative network. To distinguish between an 
influence-based assortativity or homophily-based one, statistical tests can be 
used. In this section, we discuss three tests: the shuffle test, the edge-reversal 
test, and the randomization test. The first two can detect whether influence 
exists in a network or not, but are incapable of detecting homophily. The 
last one, however, can distinguish influence and homophily. Note that in all 
these tests, we assume that several temporal snapshots of the dataset are 
available (like the LIM model) where we know exactly when each node is 
activated, when edges are formed, or when attributes are changed. 


8.4.1 Shuffle Test 

The shuffle test was originally introduced by Anagnostopoulos et al. [2008]. 
The basic idea behind the shuffle test comes from the fact that influence 
is temporal. In other words, when u influences u, then v should have been 
activated after u. So, in the shuffle test, we define a temporal assorta¬ 
tivity measure. We assume that if there is no influence, then a shuffling 
of the activation time stamps should not affect the temporal assortativity 
measurement. 

In this temporal assortativity measure, called social correlation , the 
probability of activating a node v depends on a, the number of already 
active friends it has. This activation probability is calculated using a logistic 
function, 


gCta+fi 

1 + e aa +P ’ 


p(a) = 


( 8 . 40 ) 
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or equivalently, 



(8.41) 


where a measures the social correlation and [1 denotes the activation bias. 
For computing the number of already active nodes of an individual, we need 
to know the activation time stamps of the nodes. 

Let y aJ denote the number of individuals who became activated at time t 
and had a active friends and let n aJ denote the ones who had a active friends 
but did not get activated at time t. Let y a = J2 t ya,t and n a = n aJ . We 
define the likelihood function as 


Hp(ay°(l-p(a)T°. 


(8.42) 


a 


To estimate a and /?, we find their values such that the likelihood function 
denoted in Equation 8.42 is maximized. Unfortunately, there is no closed- 
form solution, but there exist software packages that can efficiently compute 
the solution to this optimization. 

Let t u denote the activation time (when a node is first influenced) of node 
u. When activated node u influences nonactivated node v, and v is activated, 
then we have t u < t„. Hence, when temporal information is available about 
who activated whom, we see that influenced nodes are activated at a later 
time than those who influenced them. Now, if there is no influence in 
the network, we can randomly shuffle the activation time stamps, and the 
predicted a should not change drastically. So, if we shuffle activation time 
stamps and compute the correlation coefficient a' and its value is close to 
the a computed in the original unshuffled dataset (i.e., \a — a'\ is small), 
then the network does not exhibit signs of social influence. 


8.4.2 Edge-Reversal Test 


The edge-reversal test introduced by Christakis and Fowler Christakis and 
Fowler [2007] follows a similar approach as the shuffle test. If influence 
resulted in activation, then the direction of edges should be important (who 
influenced whom). So, we can reverse the direction of edges, and if there 
is no social influence in the network, then the value of social correlation a , 
as defined in Section 8.4.1, should not change dramatically. 
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Figure 8.7. The Effect of Influence and Homophily on Attributes and Links over Time 
(reproduced from La Fond and Neville [2010]). 


8.4.3 Randomization Test 

Unlike the other two tests, the randomization test La Fond and Neville 
[2010] is capable of detecting both influence and homophily in networks. 
Let X denote the attributes associated with nodes (age, gender, location, 
etc.) and X t denote the attributes at time t. Let X' denote attributes of 
node Vi . As mentioned before, in influence, individuals already linked to 
one another change their attributes (e.g., a user changes habits), whereas 
in homophily, attributes do not change but connections are formed due to 
similarity. Figure 8.7 demonstrates the effect of influence and homophily 
in a network over time. 

The assumption is that, if influence or homophily happens in a network, 
then networks become more assortative. Let A(G t , X,) denote the assorta- 
tivity of network G and attributes X at time t. Then, the network becomes 
more assortative at time t + 1 if 


A(G t+ uXt+i) - A(G t , X t ) > 0. (8.43) 

Now, we can assume that part of this assortativity is due to influence if 
influence the influence gain ^influence is positive, 

GAIN AND 

HOMOPHILY 17 Influence (0 = A{G„ X t+x ) - A(G„ X,) > 0, (8.44) 

and part is due to homophily if we have positive homophily gain G Homophily : 

G Homophily (0 = A(G t+u X,) - A(G„ X t ) > 0. (8.45) 

Note that X t+ \ denotes the changes in attributes, and G /+ i denotes the 
changes in links in the network (new friendships formed). In randomiza¬ 
tion tests, one determines whether changes in A(G t ,X /+ 1 ) — A(G t ,X t ) 
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Algorithm 8.3 Influence Significance Test 

Require: G t , G t+ \, X t , X, + u number of randomized runs n, a 
l: return Significance 

2: go = Influencifl). 

3: for all 1 < i < n do 

4: XR\ +l = randomizej(X t , X t+ 1 ); 

5: g i = A{G„XR i t+l )-A{G t ,X t )- 

6 : end for 

7: if go larger than ( 1 — a /2)% of values in {g,}?_ x then 
8: return significant; 

9: else if go smaller than a/2% of values in {g, }" =1 then 
10: return significant; 

li: else 

12 : return insignificant; 

13: end if 


(influence), or A(G t+ \, X t ) — A(G t , X t ) (homophily), are significant or 
not. To detect change significance, we use the influence significance test 
and homophily significance test algorithms outlined in Algorithms 8.3 
and 8.4, respectively. The influence significance algorithm starts with 
computing influence gain, which is the assortativity difference observed 
due to influence (go). It then forms a random attribute set at time t + 1 
(null-hypotheses), assuming that attributes changed randomly at t + 1 and 
not due to influence. This random attribute set XR' t+l is formed from 
X t+ \ by making sure that effects of influence in changing attributes are 
removed. 

For instance, assume two users u and v are connected at time t, and 
u has hobby movies at time t and v does not have this hobby listed at 
time t. Now, assuming there is an influence of u over v, so that at time 
t + 1, v adds movies to her set of hobbies. In other words, movies g X v t 
and movies e X° t+{ . To remove this influence, we can construct Xr‘ +] by 
removing movies from the hobbies of v at time t + 1 and adding some 
random hobby such as reading, which is g X" and g Xfl to the list of 
hobbies of u at time t + 1 in XR‘ +] . This guarantees that the randomized 
XR’ t+1 constructed has no sign of influence. We construct this random¬ 
ized set n times; this set is then used to compute influence gains {g/}” =1 . 
Obviously, the more distant go is from these gains, the more significant influ¬ 
ence is. We can assume that whenever go is smaller than a/2% (or larger 
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Algorithm 8.4 Homophily Significance Test 

Require: G t , G t+ \, X t , X t+ \, number of randomized runs n, a 
l: return Significance 

go = G Homophily! t ), 

3: for all 1 < i < n do 

4: GR ' t ,, = randomize h(G, , G,+ /); 

5: gi = A(GRi +l ,X t )-A(G t ,X t y, 

6 : end for 

7: if go larger than (1 — a/ 2)% of values in (g, }" = , then 
8: return significant; 

9: else if go smaller than a/2% of values in {g,}” =1 then 
10 : return significant; 

ii: else 

12 : return insignificant; 

13: end if 


than 1 — a/2%) of {g, }” =1 values, it is significant. The value of a is set 
empirically. 

Similarly, in the homophily significance test, we compute the original 
homophily gain and construct random graph links GR' t+l at time t + 1 , 
such that no homophily effect is exhibited in how links are formed. To 
perform this for any two (randomly selected) links e t j and eg/ formed in the 
original G t+ \ graph, we form edges ep and in GR‘ +1 . This is to make 
sure that the homophily effect is removed and that the degrees in GR' t+1 are 
equal to that of G t+ \. 


8.5 Summary 

Individuals are driven by different social forces across social media. Two 
such important forces are influence and homophily. 

In influence, an individual’s actions induce her friends to act in a similar 
fashion. In other words, influence makes friends more similar. Homophily 
is the tendency for similar individuals to befriend each other. Both influence 
and homophily result in networks where similar individuals are connected 
to each other. These are assortative networks. To estimate the assortativity 
of networks, we use different measures depending on the attribute type that 
is tested for similarity. We discussed modularity for nominal attributes and 
correlation for ordinal ones. 
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Influence can be quantified via different measures. Some are prediction- 
based, where the measure assumes that some attributes can accurately 
predict how influential an individual will be, such as with in-degree. Others 
are observation-based, where the influence score is assigned to an individ¬ 
ual based on some history, such as how many individuals he or she has 
influenced. We also presented case studies for measuring influence in the 
(1) blogosphere and on (2) Twitter. 

Influence is modeled differently depending on the visibility of the net¬ 
work. When network information is available, we employ threshold models 
such as the linear threshold model (LTM), and when network information is 
not available, we estimate influence rates using the linear influence model 
(LIM). Similarly, homophily can be measured by computing the assortativ- 
ity difference in time and modeled using a variant of independent cascade 
models. 

Finally, to determine the source of assortativity in social networks, we 
described three statistical tests: the shuffle test, the edge-reversal test, and 
the randomization test. The first two can determine if influence is present in 
the data, and the last one can determine both influence and homophily. All 
tests require temporal data, where activation times and changes in attributes 
and links are available. 


8.6 Bibliographic Notes 

Indications of assortativity observed in the real world can be found in 
Currarini et al. [2009], General reviews of the assortativity measuring 
methods discussed in this chapter can be found in [Newman, 2002a, 2010; 
Newman and Girvan, 2003]. 

Influence and homophily are extensively discussed in the social sciences 
literature (see [Cialdini and Trost, 1998; McPherson et al., 2001]). Inter¬ 
esting experiments in this area can be found in Milgram’s seminal experi¬ 
ment on obedience to authority Milgram [2009]. In his controversial study, 
Milgram showed many individuals, because of fear or their desire to appear 
cooperative, are willing to perform acts that are against their better judg¬ 
ment. He recruited participants in what seemingly looked like a learning 
experiment. Participants were told to administer increasingly severe elec¬ 
tric shocks to another individual (“the learner”) if he answered questions 
incorrectly. These shocks were from 15-450 volts (lethal level). In reality, 
the learner was an actor, a confederate of Milgram, and never received any 
shocks. However, the actor shouted loudly to demonstrate the painfulness 
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of the shocks. Milgram found that 65% of participants in his experiments 
were willing to give lethal electric shocks up to 450 volts to the learner, 
after being given assurance statements such as “Although the shocks may 
be painful, there is no permanent tissue damage, so please go on,” or given 
direct orders, such as “the experiment requires that you continue.” Another 
study is the 32-year longitudinal study on the spread of obesity in social 
networks [Christakis and Fowler, 2007], In this study, Christakis et al. ana¬ 
lyzed a population of 12,067 individuals. The body mass index for these 
individuals was available from 1971-2003. They showed that an individ¬ 
ual’s likelihood of becoming obese over time increased by almost 60% if 
he or she had an obese friend. This likelihood decreased to around 40% for 
those with an obese sibling or spouse. 

The analysis of influence and homophily is also an active topic in social 
media mining. For studies regarding influence and homophily online, refer 
to [Watts and Dodds, 2007; Shalizi and Thomas, 2010; Currarini et al., 
2009; Onnela and Reed-Tsochas, 2010; Weng et al., 2010; Bakshy et al., 
2001]. The effect of influence and homophily on the social network has also 
been used for prediction purposes. For instance, Tang et al. [2013a] use the 
effect of homophily for trust prediction. 

Modeling influence is challenging. For a review of threshold models, 
similar techniques, and challenges, see [Goyal et al., 2010; Watts, 2002; 
Granovetter, 1976; Kempe et al., 2003]. 

In addition to tests discussed for identifying influence or homophily, we 
refer readers to the works of Aral et al. [2009] and Snijders et al. [2006], 


8.7 Exercises 

1. State two common factors that explain why comiected people are 
similar or vice versa. 

Measuring Assortativity 

2. • What is the range [«i, «i] for modularity Q values? Provide exam¬ 

ples for both extreme values of the range, as well as cases where 
modularity becomes zero. 

• What are the limitations for modularity? 

• Compute modularity in the following graph. Assume that [a,- }f =0 
nodes are category a, (M?= 0 nodes are category b, and ]c, }f =0 
nodes are category c. 
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Influence 

3. Does the linear threshold model (LTM) converge? Why? 

4. Follow the LTM procedure until convergence for the following graph. 
Assume all the thresholds are 0.5 and node o ,■ is activated at time 0. 




5. Discuss a methodology for identifying the influential given multi¬ 
ple influence measures using the following scenario: on Twitter, one 
can use in-degree and number of retweets as two independent influ¬ 
ence measures. How can you find the influential by employing both 
measures? 
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Homophily 

6. Design a measure for homophily that takes into account assortativity 
changes due to influence. 

Distinguishing Influence and Homophily 

7. What is a shuffle test designed for in the context of social influence? 
Describe how it is performed. 

8. Describe how the edge-reversal test works. What is it used for? 



9 

Recommendation in Social Media 


Individuals in social media make a variety of decisions on a daily basis. 
These decisions are about buying a product, purchasing a service, adding a 
friend, and renting a movie, among others. The individual often faces many 
options to choose from. These diverse options, the pursuit of optimality, and 
the limited knowledge that each individual has create a desire for external 
help. At times, we resort to search engines for recommendations; however, 
the results in search engines are rarely tailored to our particular tastes and 
are query-dependent, independent of the individuals who search for them. 

To ease this process, applications and algorithms are developed to help 
individuals decide easily, rapidly, and more accurately. These algorithms are 
tailored to individuals’ tastes such that customized recommendations are 
available for them. These algorithms are called recommendation algorithms 
or recommender systems. 

Recommender systems are commonly used for product recommenda¬ 
tion. Their goal is to recommend products that would be interesting to 
individuals. Formally, a recommendation algorithm takes a set of users U 
and a set of items / and learns a function / such that 

f-.UxI^-R (9.1) 

In other words, the algorithm learns a function that assigns a real value to 
each user-item pair (u, i ), where this value indicates how interested user u 
is in item i. This value denotes the rating given by user u to item i. The rec¬ 
ommendation algorithm is not limited to item recommendation and can be 
generalized to recommending people and material, such as, ads or content. 


Recommendation to. Search 

When individuals seek recommendations, they often use web search 
engines. However, search engines are rarely tailored to individuals’ needs 
and often retrieve the same results as long as the search query stays the 
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same. To receive accurate recommendation from a search engine, one 
needs to send accurate keywords to the search engine. For instance, the 
query ‘ ‘best 2013 movie to watch’ ’ issued by an 8-year old and 
an adult will result in the same set of movies, whereas their individual tastes 
dictate different movies. 

Recommendation systems are designed to recommend individual-based 
choices. Thus, the same query issued by different individuals should result 
in different recommendations. These systems commonly employ browsing 
history, product purchases, user profile information, and friends information 
to make customized recommendations. As simple as it this process may 
look, a recommendation system algorithm actually has to deal with many 
challenges. 


9.1 Challenges 

Recommendation systems face many challenges, some of which are pre¬ 
sented next: 

• Cold-Start Problem. Many recommendation systems use historical 
data or information provided by the user to recommend items, prod¬ 
ucts, and the like. However, when individuals first join sites, they have 
not yet bought any product: they have no history. This makes it hard 
to infer what they are going to like when they start on a site. The 
problem is referred to as the cold-start problem. As an example, con¬ 
sider an online movie rental store. This store has no idea what recently 
joined users prefer to watch and therefore cannot recommend some¬ 
thing close to their tastes. To address this issue, these sites often ask 
users to rate a couple of movies before they begin recommend others 
to them. Other sites ask users to fill in profile information, such as 
interests. This information serves as an input to the recommendation 
algorithm. 

• Data Sparsity. Similar to the cold-start problem, data sparsity occurs 
when not enough historical or prior information is available. Unlike 
the cold-start problem, data sparsity relates to the system as a whole 
and is not specific to an individual. In general, data sparsity occurs 
when a few individuals rate many items while many other individuals 
rate only a few items. Recommender systems often use information 
provided by other users to help offer better recommendations to an 
individual. When this information is not reasonably available, then it 
is said that a data sparsity problem exists. The problem is more promi¬ 
nent in sites that are recently launched or ones that are not popular. 
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• Attacks. The recommender system may be attacked to recommend 
items otherwise not recommended. For instance, consider a system 
that recommends items based on similarity between ratings (e.g., lens 
A is recommended for camera B because they both have rating 4). 
Now, an attacker that has knowledge of the recommendation algorithm 
can create a set of fake user accounts and rate lens C (which is not 
as good as lens A) highly such that it can get rating 4. This way 
the recommendation system will recommend C with camera B as 
well as A. This attack is called a push attack , because it pushes the 
ratings up such that the system starts recommending items that would 
otherwise not be recommended. Other attacks such as nuke attacks 
attempt to stop the whole recommendation system algorithm and 
make it unstable. A recommendation system should have the means 
to stop such attacks. 

• Privacy. The more information a recommender system has about 
the users, the better the recommendations it provides to the users. 
However, users often avoid revealing information about themselves 
due to privacy concerns. Recommender systems should address this 
challenge while protecting individuals’ privacy. 

• Explanation. Recommendation systems often recommend items 
without having an explanation why they did so. For instance, when 
several items are bought together by many users, the system recom¬ 
mends these to new users items together. However, the system does 
not know why these items are bought together. Individuals may prefer 
some reasons forbuying items; therefore, recommendation algorithms 
should provide explanation when possible. 

9.2 Classical Recommendation Algorithms 

Classical recommendation algorithms have a long history on the web. In 
recent years, with the emergence of social media sites, these algorithms 
have been provided new information, such as friendship information, inter¬ 
actions, and so on. We review these algorithms in this section. 

9.2.1 Content-Based Methods 

Content-based recommendation systems are based on the fact that a user’s 
interest should match the description of the items that are recommended by 
the system. In other words, the more similar the item’s description to the 
user’s interest, the higher the likelihood that the user is going to find the 
item’s recommendation interesting. Content-based recommender systems 
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Algorithm 9.1 Content-based recommendation 

Require: User V s Profile Information, Item descriptions for items j e 
{1, 2,.. ., n], k keywords, r number of recommendations, 
l: return r recommended items. 

2 : Ui = (u i, U 2 , ..., Uk) = user V s profile vector; 

3: {/y}y = i = {(ij, l- ij, 2 , • • •, ij,k) = item /s description vector;}" =1 
4: Sij = sim(Uj, Ij), l < j <n; 

5: Return top r items with maximum similarity ,sy 7 . 


implement this idea by measuring the similarity between an item’s descrip¬ 
tion and the user’s profile information. The higher this similarity, the higher 
the chance that the item is recommended. 

To formalize a content-based method, we first represent both user profiles 
and item descriptions by vectorizing (see Chapter 5) them using a set of k 
keywords. After vectorization, item j can be represented as a ^--dimensional 
vector Ij = (ijji, i Ji2 , .... z'y,*) and user i as U t = (z/,j, zq, 2 , ■ ■ ■, To 
compute the similarity between user i and item j , we can use cosine simi¬ 
larity between the two vectors U, and Ij : 



(9.2) 


In content-based recommendation, we compute the topmost similar items 
to a user j and then recommend these items in the order of similarity. 
Algorithm 9.1 shows the main steps of content-based recommendation. 


9.2.2 Collaborative Filtering (CF) 


Collaborative filtering is another set of classical recommendation tech¬ 
niques. In collaborative filtering, one is commonly given a user-item matrix 
where each entry is either unknown or is the rating assigned by the user to 
an item. Table 9.1 is an user-item matrix where ratings for some cartoons 
are known and unknown for others (question marks). For instance, on a 
review scale of 5, where 5 is the best and 0 is the worst, if an entry (z, j) in 
the user-item matrix is 4, that means that user i liked item j. 

In collaborative filtering, one aims to predict the missing ratings and 
possibly recommend the cartoon with the highest predicted rating to the 
user. This prediction can be performed directly by using previous ratings 
in the matrix. This approach is called memory-based collaborative filtering 
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Table 9.1. User-Item Matrix 



Lion King 

Aladdin 

Mulan 

Anastasia 

John 

3 

0 

3 

3 

Joe 

5 

4 

0 

2 

Jill 

1 

2 

4 

2 

Jane 

3 

? 

1 

0 

Jorge 

2 

2 

0 

1 


because it employs historical data available in the matrix. Alternatively, one 
can assume that an underlying model (hypothesis) governs the way users 
rate items. This model can be approximated and learned. After the model 
is learned, one can use it to predict other ratings. The second approach is 
called model-based collaborative filtering. 


Memory-Based Collaborative Filtering 

In memory-based collaborative filtering, one assumes one of the following 
(or both) to be true: 

• Users with similar previous ratings for items are likely to rate future 
items similarly. 

• Items that have received similar ratings previously from users are 
likely to receive similar ratings from future users. 


If one follows the first assumption, the memory-based technique is a 
user-based CF algorithm, and if one follows the latter, it is an item-based 
CF algorithm. In both cases, users (or items) collaboratively help filter 
out irrelevant content (dissimilar users or items). To determine similarity 
between users or items, in collaborative filtering, two commonly used simi¬ 
larity measures are cosine similarity and Pearson correlation. Let r u j denote 
the rating that user u assigns to item i, let r u denote the average rating for 
user u, and let r, be the average rating for item i. Cosine similarity between 
users u and v is 


sim(U u , U D ) = cos(U u , U v ) = 


U u ■ U„ 


\u«\\ \\u 0 \ 


r u,i r n 


VEi u/VT,! r 0 


2 ' 

(9.3) 


And the Pearson correlation coefficient is defined as 


sim(U u , U v ) = 


E/ OV ~ r„)(r 0 ,i - r B ) 
VUi - ri/)VE,- (uv - u ) 2 


Next, we discuss user- and item-based collaborative filtering. 


(9.4) 
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User-Based Collaborative Filtering. In this method, we predict the rating 
of user u for item / by (1) finding users most similar to u and (2) using a 
combination of the ratings of these users for item i as the predicted rating of 
user u for item i. To remove noise and reduce computation, we often limit 
the number of similar users to some fixed number. These most similar users 
neighborhood are called the neighborhood for user u, N(u). In user-based collaborative 
filtering, the rating of user u for item i is calculated as 


r u ,i = r u + 


-r 0 ) 

E 0 eJV(«)™n(w,o) 


(9.5) 


where the number of members of N(u) is predetermined (e.g., top 10 most 
similar members). 


Example 9.1. In Table 9.1, rj ane , Aladdin is missing. The average ratings are 
the following: 



3+3+0+3 

FJohn — 

= 2.25 
4 


5+4+0+2 




4 


1+2+4+2 ^ 

r .Jill = 

- 7.25 


4 

fjane ~ 

3+1+0 = U3 

3 


2 + 2 + 0+1 , 

FJorge — 

-= 1.25 

4 


(9.6) 

(9.7) 

(9.8) 

(9.9) 

(9.10) 


Using cosine similarity (or Pearson correlation), the similarity between 
Jane and others can be computed: 


sim(Jane, John) = 
sim(Jane, Joe) = 
simi Jane, Jill) = 
sinjJane, Jorge) = 


3x3+lx3+0x3 

= 0.73 

(9.11) 

v/TOv/27 

3x5+lx0+0x2 

= 0.88 

(9.12) 

v/TOv/29 

3xl+lx4+0x2 

= 0.48 

(9.13) 

v/TOv/21 

3x2+lx0+0xl 

= 0.84. 

(9.14) 

VT0V5 
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Now, assuming that the neighborhood size is 2, then Jorge and Joe are 
the two most similar neighbors. Then, Jane’s rating for Aladdin computed 
from user-based collaborative filtering is 


f.Jana. Aladdin 


sim(Jane, Joe){r JoeA , addin - r Joe ) 

YJane "T” - 

sim(Jane, Joe) + simfiJane, Jorge) 

simlJane, Jorge)(r Jorg ^ A i addin - r Jorge ) 
sim(Jane, Joe) + simfJane, Jorge) 


0.88(4 -2.75)+ 0.84(2- 1.25) 
+ 0.88 + 0.84 


= 2.33 


(9.15) 


Item-based Collaborative Filtering. In user-based collaborative filtering, 
we compute the average rating for different users and find the most similar 
users to the users for whom we are seeking recommendations. Unfortu¬ 
nately, in most online systems, users do not have many ratings; therefore, 
the averages and similarities may be unreliable. This often results in a dif¬ 
ferent set of similar users when new ratings are added to the system. On 
the other hand, products usually have many ratings and their average and 
the similarity between them are more stable. In item-based CF, we perform 
collaborative filtering by finding the most similar items. The rating of user 
u for item i is calculated as 


E/eA+) sim (h 7)(+.,y - fj) 

Yaj^N(i) sim (f j) 


(9.16) 


where r t and rj are the average ratings for items i and j , respectively. 


Example 9.2. In Table 9.1, r Jane A i addin is missing. The average ratings for 
items are 



3+5+1+3+2 

2.8. 

(9.17) 

f Lion King = 

5 

fAladdin = 

0+4+2+2 o 


(9.18) 

4 



3+0+4+1+0 

1.6. 

(9.19) 

Mulan — 

5 

FAnastasia = 

3+2+2+0+1 

1.6. 

(9.20) 

5 









252 


Recommendation in Social Media 


Using cosine similarity (or Pearson correlation), the similarity between 
Aladdin and others can be computed: 


sim(AIaddin, Lion King) 


sim(AIaddin, Mulan ) 


sim(Aladdin, Anastasia ) 


0x3+4x5+2x1+2x2 

- -= 0.84. 

724^39 

(9.21) 

0 x 3 + 4 x 0 + 2 x 4 + 2 x 0 

- _ _ -= 0.32. 

724v/25 

(9.22) 

0x3 + 4x2 + 2x2 + 2x 1 

- -= 0.67. 

^ 24+18 

(9.23) 


A+w, assuming that the neighborhood size is 2, then Lion King and Anas¬ 
tasia are the two most similar neighbors. Then, Jane’s rating for Aladdin 
computed from item-based collaborative filtering is 


fJane, Aladdin — ^Aladdin + 


sim(Aladdin, Lion King)(r JaneMmKing - r LionKing ) 
sim(Aladdin, Lion King) + sim(Aladdin, Anastasia) 
Sim(Aladdin, Anastasia)(r Jane Jnastasia - rAnastasia) 


= 2 


sim(Aladdin, Lion King) + sim(Aladdin, Anastasia) 
0.84(3 - 2.8) + 0.67(0 - 1.6) 


0.84 + 0.67 


= 1.40. 


(9.24) 


Model-Based Collaborative Filtering 

In memory-based methods (either item-based or user-based), one aims to 
predict the missing ratings based on similarities between users or items. In 
model-based collaborative filtering, one assumes that an underlying model 
governs the way users rate. We aim to learn that model and then use that 
model to predict the missing ratings. Among a variety of model-based 
techniques, we focus on a well-established model-based technique that is 
singular based on singular value decomposition (SVD). 

value SVD is a linear algebra technique that, given a real matrix X e M mx ", 
decomposition w factorizes it into three matrices, 

X=U XV T , (9.25) 

lossless where U e W nxm and V e M" x " are orthogonal matrices and S e M mx " is 
matrix a diagonal matrix. The product of these matrices is equivalent to the original 
factorization matrix; therefore, no information is lost. Hence, the process is lossless. 
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Let ||X|| F = \JYm=\ XT/=i denote the Frobenius norm of matrix 
X. A low-rank matrix approximation of matrix X is another matrix 
C e M" ,x ". C approximates X, and C’s rank (the maximum number of frobenius 
linearly independent columns) is a fixed number k <SC min(m, n): norm 

rankfC ) = k. (9.26) 

The best low-rank matrix approximation is a matrix C that minimizes 
\\X — C|| /. Low-rank approximations of matrices remove noise by assum¬ 
ing that the matrix is not generated at random and has an underlying struc¬ 
ture. SVD can help remove noise by computing a low-rank approximation 
of a matrix. Consider the following matrix AT, which we construct from 
matrix X after computing the SVD of X = U X V 1 : 

1. Create X A from X by keeping only the first k elements on the diagonal. 

This way, X t e M ixt . 

2. Keep only the first 4 columns of U and denote it as 64 £ W nxk , and 
keep only the first k rows of V T and denote it as V k T e M. kxn . 

3. Let X k = U k T. k V k r , X k e M mx ". 

As it turns out, X k is the best low-rank approximation of a matrix X. eckart- 
The following Eckart-Young-Mirsky theorem outlines this result. young- 

MIRSKY 

Theorem 9.1 (Eckart-Young-Mirsky Low-Rank Matrix Approximation), theorem 
Let X be a matrix and C be the best low-rank approximation of X; if 
\\X — C\\p is minimized, andrank(C) = k, then C = X k . 

To summarize, the best rank-4 approximation of the matrix can be easily 
computed by calculating the SVD of the matrix and then taking the first k 
columns of 67, truncating X to the the first k entries, and taking the first k 
rows of V T . 

As mentioned, low-rank approximation helps remove noise from a matrix 
by assuming that the matrix is low rank. In low-rank approximation using 
SVD, if X e M mx ", then U k e M mxi , X* e M* xi ', and V k e M* x \ Hence, 

64 has the same number of rows as X, but in a A'-dimensional space. 

Therefore, 64 represents rows of X, but in a transformed Ar-dimensional 
space. The same holds for V k because it has the same number of columns 
as X , but in a 4-dimensional space. To summarize, 64 and V k can be 
thought of as 4-dimensional representations of rows and columns of X. In 
this 4-dimensional space, noise is removed and more similar points should 
be closer. 

Now, given the user-item matrix X, we can remove its noise by com¬ 
puting X k from X and getting the new 4-dimensional user space 64 or the 
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Table 9.2. An User-Item Matrix 



Lion King 

Aladdin 

Mulan 

John 

3 

0 

3 

Joe 

5 

4 

0 

Jill 

1 

2 

4 

Jorge 

2 

2 

0 


A-dimcnsional item space F/'. This way, we can compute the most similar 
neighbors based on distances in this A'-dimensional space. The similarity in 
the Ar-dimensional space can be computed using cosine similarity or Pearson 
correlation. We demonstrate this via Example 9.3. 


Example 9.3. Consider the user-item matrix, in Table 9.2. Assuming this 
matrix is X, then by computing the SVD of X = UT, V T we have 


-0.4151 

-0.4754 


-0.7679 

0.1093 

-0.7437 

0.5278 


0.0169 

-0.4099 

-0.4110 

-0.6626 


0.6207 

-0.0820 

-0.3251 

0.2373 


0.1572 

0.9018 

8.0265 

0 

0 




0 

4.3886 

0 




0 

0 2.0777 



0 

0 

0 




-0.7506 

-0.5540 


-0.3600' 


0.2335 

0.2872 


-0.9290 


-0.6181 

0.7814 


0.0863 



(9.27) 


(9.28) 


(9.29) 


Considering a rank 2 approximation (i.e., k = 2), we truncate all three 
matrices: 


'-0.4151 -0.4754' 

-0.7437 0.5278 

-0.4110 -0.6626 
-0.3251 0.2373 


Si 


8.0265 0 

0 4.3886 


(9.30) 


(9.31) 


-0.7506 -0.5540 -0.3600 
0.2335 0.2872 -0.9290 


(9.32) 
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-0.8 


Joe 

Lion King 


Aladdin 
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-0.6 -0.4 

Figure 9.1. Users and Items in the 2-D Space. 


- 0.2 


The rows ofUk represent users. Similarly the columns ofVf (or rows of 
Vk) represent items. Thus, we can plot users and items in a 2-D figure. By 
plotting user rows or item columns, we avoid computing distances between 
them and can visually inspect items or users that are most similar to one 
another. Figure 9.1 depicts users and items depicted in a 2-D space. As 
shown, to recommend for Jill, John is the most similar individual to her. 
Similarly, the most similar item to Lion King is Aladdin. 

After most similar items or users are found in the lower /c-dimensional 
space, one can follow the same process outlined in user-based or item-based 
collaborative filtering to find the ratings for an unknown item. For instance, 
we showed in Example 9.3 (see Figure 9.1) that if we are predicting the 
rating rj iULionKing and assume that neighborhood size is 1, item-based CF 
uses rjni Aladdin , because Aladdin is closest to Lion King. Similarly, user- 
based collaborative filtering uses r JohnLion King , because John is the closest 
user to Jill. 


9.2.3 Extending Individual Recommendation to 
Groups of Individuals 

All methods discussed thus far are used to predict a rating for item i for 
an individual u. Advertisements that individuals receive via email market¬ 
ing are examples of this type of recommendation on social media. How¬ 
ever, consider ads displayed on the starting page of a social media site. 
These ads are shown to a large population of individuals. The goal when 
showing these ads is to ensure that they are interesting to the individuals 
who observe them. In other words, the site is advertising to a group of 
individuals. 
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Our goal in this section is to formalize how existing methods for recom¬ 
mending to a single individual can be extended to a group of individuals. 
Consider a group of individuals G = {u\, ui ,..., u n } and a set of prod¬ 
ucts I = {z'i, 12 ,.. -, i m }• From the products in /, we aim to recommend 
products to our group of individuals G such the recommendation satisfies 
the group being recommended to as much as possible. One approach is to 
first consider the ratings predicted for each individual in the group and then 
devise methods that can aggregate ratings for the individuals in the group. 
Products that have the highest aggregated ratings are selected for recom¬ 
mendation. Next, we discuss these aggregation strategies for individuals in 
the group. 


Aggregation Strategies for a Group of Individuals 

We discuss three major aggregation strategies for individuals in the group. 
Each aggregation strategy considers an assumption based on which ratings 
are aggregated. Let r Uji denote the rating of user u e G for item i e I. 
Denote R, as the group-aggregated rating for item i. 

Maximizing Average Satisfaction. We assume that products that satisfy 
each member of the group on average are the best to be recommended to the 
group. Then, R, group rating based on the maximizing average satisfaction 
strategy is given as 

Ri = -Y j r u , i . (9.33) 

After we compute R, for all items i e /, we recommend the items that 
have the highest R, ’s to members of the group. 

Least Misery. This strategy combines ratings by taking the minimum of 
them. In other words, we want to guarantee that no individuals is being 
recommended an item that he or she strongly dislikes. In least misery, the 
aggregated rating Rj of an item is given as 

R i = minr„,. (9.34) 

«eG 

Similar to the previous strategy, we compute R, for all items i e / and 
recommend the items with the highest Rj values. In other words, we prefer 
recommending items to the group such that no member of the group strongly 
dislikes them. 
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Most Pleasure. Unlike the least misery strategy, in the most pleasure 
approach, we take the maximum rating in the group as the group rating: 

R, = max r ui . (9.35) 

ueG 

Since we recommend items that have the highest Rj values, this strategy 
guarantees that the items that are being recommended to the group are 
enjoyed the most by at least one member of the group. 

Example 9.4. Consider the user-item matrix in Table 9.3. Consider group 
G = {John, Jill, Juan}. For this group, the aggregated ratings for all prod¬ 
ucts using average satisfaction, least misery, and maximum pleasure are as 
follows. 


Table 9.3. User-Item Matrix 



Soda 

Water 

Tea 

Coffee 

John 

1 

3 

1 

1 

Joe 

4 

3 

1 

2 

Jill 

2 

2 

4 

2 

Jorge 

1 

1 

3 

5 

Juan 

3 

3 

4 

5 


Average Satisfaction: 


Rsoda — 

1+2 + 3 

= 2. 

(9.36) 

3 

R Water ~ 

3 + 2 + 3 

= 2.66. 

(9.37) 

3 

Rlea ~ 

1 + 4 + 4 

= 3. 

(9.38) 

3 

R Coffee — 

1+2 + 5 

= 2.66. 

(9.39) 

3 


Least Misery: 


Rsoda = min{l, 2, 3} = 1. 

(9.40) 

RWater = min{3, 2, 3} = 2. 

(9.41) 

R'/ea = min{l, 4, 4} = 1. 

(9.42) 

R Coffee = min{l, 2, 5} = 1. 

(9.43) 
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Maximum Pleasure: 


Rsoda = max{l, 2, 3} = 3. 
RWater = max{3, 2, 3} = 3. 

Rrea = max{l, 4, 4} = 4. 
Rcoffee = max{l, 2, 5} = 5. 


(9.44) 

(9.45) 

(9.46) 

(9.47) 


Thus, the first recommended items are tea, water, and coffee based on 
average satisfaction, least misery>, and maximum pleasure, respectively. 


9.3 Recommendation Using Social Context 


In social media, in addition to ratings of products, there is additional infor¬ 
mation available, such as the friendship network among individuals. This 
information can be used to improve recommendations, based on the assump¬ 
tion that an individual’s friends have an impact on the ratings ascribed to the 
individual. This impact can be due to homophily, influence, or confound¬ 
ing, discussed in Chapter 8. When utilizing this social information (i.e., 
social context) we can (1) use friendship information alone, (2) use social 
information in addition to ratings, or (3) constrain recommendations using 
social information. Figure 9.2 compactly represents these three approaches. 


9.3.1 Using Social Context Alone 


Consider a network of friendships for which no user-item rating matrix is 
provided. In this network, we can still recommend users from the network 
to other users for friendship. This is an example of friend recommendation 
in social networks. For instance, in social networking sites, users are often 
provided with a list of individuals they may know and are asked if they wish 
to befriend them. How can we recommend such friends? 

There are many methods that can be used to recommend friends in social 
networks. One such method is link prediction, which we discuss in detail 
in Chapter 10. We can also use the structure of the network to recommend 
friends. For example, it is well known that individuals often form triads of 
friendships on social networks. In other words, two friends of an individual 
are often friends with one another. A triad of three individuals a, b, and c 
consists of three edges e(a, b ), e{b, c), and e(c, a). A triad that is missing 
one of these edges is denoted as an open triad. To recommend friends, we 
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Figure 9.2. Recommendation using Social Context. When utilizing social information, 
we can 1) utilize this information independently, 2) add it to user-rating matrix, or 3) 
constrain recommendations with it. 


can find open triads and recommend individuals who are not connected as 
friends to one another. 


9.3.2 Extending Classical Methods with Social Context 

Social information can also be used in addition to a user-item rating 
matrix to improve recommendation. Addition of social information can 
be performed by assuming that users that are connected (i.e., friends) 
have similar tastes in rating items. We can model the taste of user Ca¬ 
using a A'-dimensional vector Uj £ M /l ' x 1 . We can also model items in the 
^-dimensional space. Let Vj e M ixl denote the item representation in k- 
dimensional space. We can assume that rating R t/ given by user i to item j 
can be computed as 

R t j = UjV,. (9.48) 

To compute (/, and V,, we can use matrix factorization. We can rewrite 
Equation 9.48 in matrix format as 

R = U t V, (9.49) 

where R £ M" xm , U £ R ix ", V £ M ixm , n is the number of users, and m is 
the number of items. Similar to model-based CF discussed in Section 9.2.2, 
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matrix factorization methods can be used to find U and V, given user-item 
rating matrix R. In mathematical terms, in this matrix factorization, we are 
finding U and V by solving the following optimization problem: 

min -|[f? — U T V\\ 2 f . (9.50) 

u,v 2 

Users often have only a few ratings for items; therefore, the R matrix is 
very sparse and has many missing values. Since we compute U and V only 
for nonmissing ratings, we can change Equation 9.50 to 

1 n m 

(9.51) 

i =1 j =1 

where /, 7 e (0, 1} and /, 7 = 1 when user i has rated item j and is equal 
to 0 otherwise. This ensures that nonrated items do not contribute to the 
summations being minimized in Equation 9.51. Often, when solving this 
optimization problem, the computed U and V can estimate ratings for 
the already rated items accurately, but fail at predicting ratings for unrated 
overfitting items. This is known as the overfitting problem. The overfitting problem can 
be mitigated by allowing both U and V to only consider important features 
required to represent the data. In mathematical terms, this is equivalent to 
both U and V having small matrix norms. Thus, we can change Equation 
9.51 to 

1 n m 2 2 

2 E E ! u( R U - U? Vjf + ^ 11 U\\ 2 f + j -11 V\ \ 2 f , (9.52) 

< = 1 7 = 1 

where k\, A 2 > 0 are predetermined constants that control the effects of 
matrix norms. The terms } -j\\U\\ 2 f and 4*[[ V\\ 2 F are denoted as regulariza- 
REGULARizATiotfzon terms. Note that to minimize Equation 9.52, we need to minimize all 
term terms in the equation, including the regularization terms. Thus, whenever 
one needs to minimize some other constraint, it can be introduced as a new 
additive term in Equation 9.52. Equation 9.52 lacks a term that incorporates 
the social network of users. For that, we can add another regularization term, 

n 

J2 E s im{i, j)\\Ui — Uj\\ 2 f , (9.53) 

1=1 jeF(i) 

where sim(i, j) denotes the similarity between user i and j (e.g., cosine 
similarity or Pearson correlation between their ratings) and F(i) denotes 
the friends of i. When this term is minimized, it ensures that the taste for 
user i is close to that of all his friends j e F(i). As we did with previous 
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regularization terms, we can add this term to Equation 9.51. Hence, our 
final goal is to solve the following optimization problem: 

| n m rt 

2 E E - U v ,f + 1> E E sim(i,j)\\Ui -Uj\\ 2 p 

i=\ j= 1 i= 1 ;e.F(i) 

+ yl|t/|| 2 F +yll^ll^ (9.54) 

where fi is the constant that controls the effect of social network regulariza¬ 
tion. A local minimum for this optimization problem can be obtained using 
gradient-descent-based approaches. To solve this problem, we can compute 
the gradient with respect to (/,-’s and V t ’s and perform a gradient-descent- 
based method. 


9.3.3 Recommendation Constrained by Social Context 

In classical recommendation, to estimate ratings of an item, one determines 
similar users or items. In other words, any user similar to the individual 
can contribute to the predicted ratings for the individual. We can limit the 
set of individuals that can contribute to the ratings of a user to the set of 
friends of the user. For instance, in user-based collaborative filtering, we 
determine a neighborhood of most similar individuals. We can take the 
intersection of this neighborhood with the set of friends of the individual to 
attract recommendations only from friends who are similar enough : 


r u ,i = r u + 


E,;ejV( a )nFW™(“’ ~ '"«) 


(9.55) 


This approach has its own shortcomings. When there is no intersection 
between the set of friends and the neighborhood of most similar individuals, 
the ratings cannot be computed. To mitigate this, one can use the set of k 
most similar friends of an individual S(i) to predict the ratings, 


r u ,i = r u 


Tn >( ES(u) sim ( U ’ V )( r »,i ~ W) 

T,veS(.u) sim (u, 0) 


(9.56) 


Similarly, when friends are not very similar to the individual, the pre¬ 
dicted rating can be different from the rating predicted using most similar 
users. Depending on the context, both equations can be utilized. 
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Table 9.4. User-Item Matrix 



Lion King 

Aladdin 

Mulan 

Anastasia 

John 

4 

3 

2 

2 

Joe 

5 

2 

1 

5 

Jill 

2 

5 

? 

0 

Jane 

1 

3 

4 

3 

Jorge 

3 

1 

1 

2 


Example 9.5. Consider the user-item matrix in Table 9.4 and the following 
adjacency matrix denoting the friendship among these individuals. 

John Joe Jill Jane 
John 0 10 0 

Joe 1 0 1 0 

Jill 0 10 1 

Jane 0 0 10 

Jorge 10 10 

We wish to predict rjui,Mu!an- We compute the average ratings and simi¬ 
larity between Jill and other individuals using cosine similarity: 


Jorge 

1 

0 

1 

0 

0 


(9.57) 



4 + 3 + 2 + 2 


?John — 

= 2 . 75 . 

4 

(9.58) 


5 + 2 + 1 + 5 


VJoe = 

= 3.25. 

4 

(9.59) 


2 + 5 + 0 


r Jill = 

= 2.33. 

3 

(9.60) 


1+3+4+3 


1*Jane ~ 

= 2.75. 

4 

(9.61) 


3+1 + 1+2 , _ 


VJorge = 

= 1.75. 

4 

(9.62) 


The similarities are 

„ , 2x4+5x3+0x2 ^ „ 

sim(Jill, John) = - - - - = 0.79. (9.63) 

V29V29 

„ 2x5+5x2+0x5 

sim(Jill,Joe) = - - - - = 0.50. (9.64) 

V29V54 

sim(Jill, Jane) = - JT - = 0.72. (9.65) 

V29x/l9 


sim(Jill, Jorge) 


2 x 3 + 5 x 1 + 0x2 

V297I4 


0.54. 


(9.66) 
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Considering a neighborhood of size 2, the most similar users to Jill are 
John and Jane: 


N(Jill) = {John, Jane). (9.67) 

We also know that friends of Jill are 

F(Jill) = [Joe, Jane, Jorge). (9.68) 

We can use Equation 9.55 to predict the missing rating by taking the 
intersection of friends and neighbors: 


r.hlLMulan 


I sim(Jill, Jane)(t 'j ane ,Mul an rj ane ) 

sim(Jill, Jane) 

2.33 + (4 - 2.75) = 3.58. 


(9.69) 


Similarly, we can utilize Equation 9.56 to compute the missing rating. 
Edere, we take Jill s two most similar neighbors: Jane and Jorge. 

sim(Jill, Jane)(rj ane M uian rjane) 


rJill.Mulan — rJill + 


= 2.33 


sim(Jill, Jane) + sim(Jill, Jorge) 

sim(Jill, J<>rge)(rj olye \iuia„ rj or g e ) 

sim(Jill, Jane) + sim(Jill, Jorge) 
0.72(4 -2.75)+ 0.54(1 - 1.75) 


0.72 + 0.54 


= 2.72 (9.70) 


9.4 Evaluating Recommendations 

When a recommendation algorithm predicts ratings for items, one must 
evaluate how accurate its recommendations are. One can evaluate the (1) 
accuracy of predictions, (2) relevancy of recommendations, or (3) rankings 
of recommendations. 


9.4.1 Evaluating Accuracy of Predictions 

When evaluating the accuracy of predictions, we measure how close pre¬ 
dicted ratings are to the true ratings. Similar to the evaluation of supervised 
learning, we often predict the ratings of some items with known ratings (i.e., 
true ratings) and compute how close the predictions are to the true ratings. 
One of the simplest methods, mean absolute error (MAE), computes the 
average absolute difference between the predicted ratings and true ratings, 

MAE= T “ jlPiJ ~ rii \ (9.71) 


n 
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where n is the number of predicted ratings, r { j is the predicted rating, and 
fii is the true rating. Normalized mean absolute error (NMAE) normalizes 
MAE by dividing it by the range ratings can take, 

MAE 

NMAE = --—, (9.72) 

' max 1 min 

where r max is the maximum rating items can take and r mm is the minimum. 
In MAE, error linearly contributes to the MAE value. We can increase this 
contribution by considering the summation of squared errors in the root 
mean squared error (RMSE): 


RMSE = 


\ 


- £(% - r uf 


t,j 


(9.73) 


Example 9.6. Consider the following table with both the predicted ratings 
and true ratings of five items: 


Item 

Predicted Rating 

True Rating 

1 

1 

3 

2 

2 

5 

3 

3 

3 

4 

4 

2 

5 

4 

1 


The MAE, NMAE, and RMSE values are 


MAE = |1 — 3| + |2 — 5| + |3 — 3| + |4 — 2| + |4 — 1| = ^ 


MAE 

NMAE = -= 0.5. 


5-1 


(9.74) 

(9.75) 


RMSE = 


'(1 - 3) 2 + (2 - 5) 2 + (3 - 3)2 + (4 - 2)2 + (4 - l) 2 


(9.76) 


= 2.28. 


9.4.2 Evaluating Relevancy of Recommendations 

When evaluating recommendations based on relevancy, we ask users if 
they find the recommended items relevant to their interests. Given a set 
of recommendations to a user, the user describes each recommendation as 
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Table 9.5. Partitioning of Items with Respect to Their 
Selection for Recommendation and Their Relevancy 



Selected 

Not Selected 

Total 

Relevant 

N rs 

N rn 

N r 

Irrelevant 

N is 

N in 

N t 

Total 

N s 

N„ 

N 


relevant or irrelevant. Based on the selection of items for recommendations 
and their relevancy, we can have the four types of items outlined in Table 9.5. 
Given this table, we can define measures that use relevancy information 
provided by users. Precision is one such measure. It defines the fraction of 
relevant items among recommended items: 

P = —. (9.77) 

N s 

Similarly, we can use recall to evaluate a recommender algorithm, which 
provides the probability of selecting a relevant item for recommendation: 


R = 


Nrs 

N r 


(9.78) 


We can also combine both precision and recall by taking their harmonic 
mean in the F-measure : 


2 PR 
P + R' 


(9.79) 


Example 9.7. Consider the following recommendation relevancy matrix 
for a set of 40 items. For this table, the precision, recall, and F-measure 
values are 



Selected 

Not Selected 

Total 

Relevant 

9 

15 

24 

Irrelevant 

3 

13 

16 

Total 

12 

28 

40 


P 

R 

F 


— = 0.75. 
12 


— = 0.375. 

24 

2 x 0.75 x 0.375 
0.75 + 0.375 


= 0.5. 


(9.80) 

(9.81) 

(9.82) 
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9.4.3 Evaluating Ranking of Recommendations 

Often, we predict ratings for multiple products for a user. Based on the pre¬ 
dicted ratings, we can rank products based on their levels of interestingness 
to the user and then evaluate this ranking. Given the true ranking of inter¬ 
estingness of items, we can compare this ranking with it and report a value. 
Rank correlation measures the correlation between the predicted ranking 
and the true ranking. One such technique is the Spearman’s rank correla¬ 
tion discussed in Chapter 8 . Let jc,-, 1 < x, < n, denote the rank predicted 
for item i, 1 < i < n. Similarly, let y, , 1 < yt < n, denote the true rank of 
item i from the user’s perspective. Spearman’s rank correlation is defined 
as 


>o = l — 


6 £”=i (xt -yif 
— n 


(9.83) 


where n is the total number of items. 

Here, we discuss another rank correlation measure: Kendall’s tau. We 
kendall’s say that the pair of items (i, j) are concordant if their ranks {x,, y ,} and 
TAU {xj , 3 >j } are in order: 


Xi>xj, yi > yj or Xi<xj, Vi<yj. (9.84) 


A pair of items is discordant if their corresponding ranks are not in 
order: 


Xi>xj, Vi < yj or x, < xj, Vi>yj. (9.85) 


When Xj = Xj or y t = yj, the pair is neither concordant nor discordant. 
Let c denote the total number of concordant item pairs and d the total 
number of discordant item pairs. Kendall’s tau computes the difference 
between the two, normalized by the total number of item pairs (”): 


c — d 

T= ir 


(9.86) 


Kendall’s tau takes value in range [—1, 1]. When the ranks completely 
agree, all pairs are concordant and Kendall’s tau takes value 1, and when the 
ranks completely disagree, all pairs are discordant and Kendall’s tau takes 
value — 1 . 
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Example 9.8. Consider a set of four items I = [ih, h, U}for which the 
predicted and true rankings are as follows: 



Predicted Rank 

True Rank 

i\ 

1 

1 

h 

2 

4 

h 

3 

2 

*4 

4 

3 


The pair of items and their status {concordant/discordant} are 


(i i, if) : concordant 

(9.87) 

( 11 , 13 ) : concordant 

(9.88) 

(z' 1 ,/ 4 ) : concordant 

(9.89) 

(z* 2 , if) : discordant 

(9.90) 

(z' 2 , if) : discordant 

(9.91) 

(z* 3 , z' 4 ) : concordant 

(9.92) 


Thus, Kendall’s tau for the rankings is 

4-2 

z = -= 0.33. (9.93) 

6 

9.5 Summary 

In social media, recommendations are constantly being provided. Friend 
recommendation, product recommendation, and video recommendation, 
among others, are all examples of recommendations taking place in social 
media. Unlike web search, recommendation is tailored to individuals’ inter¬ 
ests and can help recommend more relevant items. Recommendation is 
challenging due to the cold-start problem, data sparsity, attacks on these 
systems, privacy concerns, and the need for an explanation for why items 
are being recommended. 

In social media, sites often resort to classical recommendation algorithms 
to recommend items or products. These techniques can be divided into 
content-based methods and collaborative filtering techniques. In content- 
based methods, we use the similarity between the content (e.g., item descrip¬ 
tion) of items and user profiles to recommend items. In collaborative filter¬ 
ing (CF), we use historical ratings of individuals in the form of a user-item 
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matrix to recommend items. CF methods can be categorized into memory- 
based and model-based techniques. In memory-based techniques, we use 
the similarity between users (user-based) or items (item-based) to predict 
missing ratings. In model-based techniques, we assume that an underlying 
model describes how users rate items. Using matrix factorization techniques 
we approximate this model to predict missing ratings. Classical recommen¬ 
dation algorithms often predict ratings for individuals. We discussed ways 
to extend these techniques to groups of individuals. 

In social media, we can also use friendship information to give rec¬ 
ommendations. These friendships alone can help recommend (e.g., friend 
recommendation), can be added as complementary information to classical 
techniques, or can be used to constrain the recommendations provided by 
classical techniques. 

Finally, we discussed the evaluation of recommendation techniques. 
Evaluation can be performed in terms of accuracy, relevancy, and rank of 
recommended items. We discussed MAE, NMAE, and RMSE as methods 
that evaluate accuracy, precision, recall, and F-measure from relevancy- 
based methods, and Kendall’s tau from rank-based methods. 

9.6 Bibliographic Notes 

General references for the content provided in this chapter can be found 
in [Jannach et al., 2010; Resnick and Varian, 1997; Schafer et al., 1999; 
Adomavicius and Tuzhilin, 2005], In social media, recommendation is uti¬ 
lized for various items, including blogs [Arguello et al., 2008], news [Liu 
et al., 2010; Das et al., 2007], videos [Davidson et al., 2010], and tags 
[Sigurbjornsson and Van Zwol, 2008], For example, YouTube video recom¬ 
mendation system employs co-visitation counts to compute the similarity 
between videos (items). To perform recommendations, videos with high 
similarity to a seed set of videos are recommended to the user. The seed 
set consists of the videos that users watched on YouTube (beyond a certain 
threshold), as well as videos that are explicitly favorited, “liked,” rated, or 
added to playlists. 

Among classical techniques, more on content-based recommendation 
can be found in [Palla et al., 2007], and more on collaborative filtering can 
be found in [Su and Khoshgoftaar, 2009; Sarwar et al., 2001; Schafer et al., 
2007]. Content-based and CF methods can be combined into hybrid meth¬ 
ods, which are not discussed in this chapter. A survey of hybrid methods is 
available in [Burke, 2002]. More details on extending classical techniques 
to groups are provided in Jameson and Smyth [2007]. 
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When making recommendations using social context, we can use addi¬ 
tional information such as tags [Guy et al., 2010; Sen et ah, 2009] or 
trust [Golbeck and Hendler, 2006; O’Donovan and Smyth, 2005; Massa 
and Avesani, 2004; Ma et al., 2009], For instance, in [Tang, Gao, and Liu, 
2012 b], the authors discern multiple facets of trust and apply multifaceted 
trust in social recommendation. In another work, Tang et al. [20 1 2a] exploit 
the evolution of both rating and trust relations for social recommendation. 
Users in the physical world are likely to ask for suggestions from their local 
friends while they also tend to seek suggestions from users with high global 
reputations (e.g., reviews by vine voice reviewers of Amazon.com). There¬ 
fore, in addition to friends, one can also use global network information for 
better recommendations. In [Tang et al., 2013b], the authors exploit both 
local and global social relations for recommendation. 

When recommending people (potential friends), we can use all these 
types of information. A comparison of different people recommendation 
techniques can be found in the work of Chen et al. [2009]. Methods that 
extend classical techniques with social context are discussed in [Ma et al., 
2008, 2011; Konstas et al., 2009]. 

9.7 Exercises 

Classical Recommendation Algorithm 

1. Discuss one difference between content-based recommendation and 
collaborative filtering. 

2. Compute the missing rating in this table using user-based collaborative 
filtering (CF). Use cosine similarity to find the nearest neighbors. 



God 

Le Cercle 
Rouge 

Cidade 
de Deu 

Rcishomon 

La vita 
e bella 


Newton 

3 

0 

3 

3 

2 


Einstein 

5 

4 

0 

2 

3 


Gauss 

1 

2 

4 

2 

0 


Aristotle 

3 

? 

1 

0 

2 

1.5 

Euclid 

2 

2 

0 

1 

5 



Assuming that you have computed similarity values in the following 
table, calculate Aristotle’s rating by completing these four tasks: 



Newton 

Einstein 

Gauss 

Euclid 

Aristotle 

0.76 

? 

0.40 

0.78 
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• Calculate the similarity value between Aristotle and Einstein. 

• Identify Aristotle’s two nearest neighbors. 

• Calculate r u values for everyone (Aristotle’s is given). 

• Calculate Aristotle’s rating for Le Cercle Rouge. 

3. In an item-based recommendation, describe how the recommender 
finds and recommends items to the given user. 


Recommendation Using Social Context 

4. Provide two examples where social context can help improve classical 
recommendation algorithms in social media. 

5. In Equation 9.54, the term yff X37=i y)||C7,- — Uj\\ 2 f is 

added to model the similarity between friends’ tastes. Let T e M” x " 
denote the pairwise trust matrix, in which 0 < 7), < 1 denotes how 
much user i trusts user j. Using your intuition on how trustworthi¬ 
ness of individuals should affect recommendations received from them, 
modify Equation 9.54 using trust matrix T. 


Evaluating Recommendation Algorithms 

6 . What does “high precision” mean? Why is precision alone insuffi¬ 
cient to measure performance under normal circumstances? Provide an 
example to show that both precision and recall are important. 

7. When is Kendall’s tau equal to — 1? In other words, how is the predicted 
ranking different from the true ranking? 


10 

Behavior Analytics 


What motivates individuals to join an online group? When individuals 
abandon social media sites, where do they migrate to? Can we predict 
box office revenues for movies from tweets posted by individuals? These 
questions are a few of many whose answers require us to analyze or predict 
behaviors on social media. 

Individuals exhibit different behaviors in social media: as individuals or 
as part of a broader collective behavior. When discussing individual behav¬ 
ior, our focus is on one individual. Collective behavior emerges when a 
population of individuals behave in a similar way with or without coordi¬ 
nation or planning. 

In this chapter we provide examples of individual and collective behav¬ 
iors and elaborate techniques used to analyze, model, and predict, these 
behaviors. 


10.1 Individual Behavior 

We read online news; comment on posts, blogs, and videos; write reviews 
for products; post; like; share; tweet; rate; recommend; listen to music; and 
watch videos, among many other daily behaviors that we exhibit on social 
media. What are the types of individual behavior that leave a trace on social 
media? 

We can generally categorize individual online behavior into three cate¬ 
gories (shown in Figure 10.1): 

1. User-User Behavior. This is the behavior individuals exhibit with 
respect to other individuals. For instance, when befriending someone, 
sending a message to another individual, playing games, following, 
inviting, blocking, subscribing, or chatting, we are demonstrating a 
user-user behavior. 


271 


272 


Behavior Analytics 



Figure 10.1. Individual Behavior. 


2. User-Community Behavior. The target of this type of behavior is a 
community. For example joining or leaving a community, becoming 
a fan of a community, or participating in its discussions are forms of 
user-community behavior. 

3. User-Entity Behavior. The target of this behavior is an entity in 
social media. For instance, it includes writing a blogpost or review 
or uploading a photo to a social media site. 

As we know, link data and content data are frequently available on social 
media. Link data represents the interactions users have with other users, 
and content data is generated by users when using social media. One can 
think of user-user behavior as users linking to other users and user-entity 
behavior as users generating and consuming content. Users interacting 
with communities is a blend of linking and content-generation behavior, 
in which one can simply join a community (linking), read or write content 
for a community (content consumption and generation), or can do a mix 
of both activities. Link analysis and link prediction are commonly used to 
analyze links, and text analysis is designed to analyze content. We use these 
techniques to analyze, model, and predict individual behavior. 




10.1 Individual Behavior 


273 


10.1.1 Individual Behavior Analysis 

Individual behavior analysis aims to understand how different factors affect 
individual behaviors observed online. It aims to correlate those behaviors 
(or their intensity) with other measurable characteristics of users, sites, or 
contents that could have possibly resulted in those behaviors. 

First we discuss an example of behavior analysis on social media and 
demonstrate how this behavior can be analyzed. After that, we outline the 
process that can be followed to analyze any behavior on social media. 


Community Membership in Social Media 

Users often join different communities in social media; the act of becoming 
a community member is an example of user-community behavior. Why do 
users join communities? In other words, what factors affect the community¬ 
joining behavior of individuals? 

To analyze community-joining behavior, we can observe users who join 
communities and determine the factors that are common among them. 
Hence, we require a population of users U = {u\, « 2 ,..., w,,}, a community 
C, and community membership information (i.e., users u, e U who are 
members of C). The community need not be explicitly defined. For instance, 
one can think of individuals buying a product as a community), and people 
buying the product for the first time as individuals joining the community. 
To distinguish between users who have already joined the community and 
those who are now joining it, we need community memberships at two 
different times: t\ and t 2 , with t 2 > t\. At t 2 , we determine users such as u 
who are currently members of the community, but were not members at t\. 
These new users form the subpopulation that is analyzed for community¬ 
joining behavior. 

To determine factors that affect community-joining behavior, we can 
design hypotheses based on different factors that describe when community¬ 
joining behavior takes place. We can verify these hypotheses by using data 
available on social media. The factors used in the validated hypotheses 
describe the behavior under study most accurately. 

One such hypothesis is that individuals are inclined toward an activity 
when their friends are engaged in the same activity. Thus, if the hypothesis 
is valid, a factor that plays a role in users joining a community is the 
number of their friends who are already members of the community. In data 
mining terms, this translates to using the number of friends of an individual 
in a community as a feature to predict whether the individual joins the 
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DIMINISHING 

RETURNS 



Figure 10.2. Probability of Joining a Community (with Error Bars) as a Function of the 
Number of Friends m Already in the Community (from Backstrom et al. [2006]). 


community (i.e., class attribute). Figure 10.2 depicts the probability of 
joining a community with respect to the number of friends an individual 
has who are already members of the community. The probability increases 
as more friends are in a community, but a diminishing returns property is 
also observed, meaning that when enough friends are inside the community, 
more friends have no or only marginal effects on the likelihood of the 
individual’s act of joining the community. 

Thus far we have defined only one feature. However, one can go beyond 
a single feature. Figure 10.3 lists the comprehensive features that can be 
used to analyze community-joining behavior. 

As discussed, these features may or may not affect the joining behavior; 
thus, a validation procedure is required to understand their effect on the 
joining behavior. Which one of these features is more relevant to the joining 
behavior? In other words, which feature can help best determine whether 
individuals will join or not? 

To answer this question, we can use any feature selection algorithm. 
Feature selection algorithms determine features that contribute the most to 
the prediction of the class attribute. Alternatively, we can use a classifica¬ 
tion algorithm, such as decision tree learning, to identify the relationship 
between features and the class attribute (i.e., joined={Yes, No}). The ear¬ 
lier a feature is selected in the learned tree (i.e., is closer to the root of the 
tree), the more important to the prediction of the class attribute value. 




























10.1 Individual Behavior 


275 


Feature Set 

Feature 

Features related 

to the community, 
C. (Edges between 
only members of 
the community are 
E c C E.) 

Number of members (|C|). 

Number of individuals with a friend in C (the fringe of C ). 

Number of edges with one end in the community and the other in the fringe. 

Number of edges with both ends in the community, \Ec\- 

The number of open triads: |{(u, v, w)\(u, v) £ Ec A ( [v , w) £ Ec A (u, w) ^ Ec Au ^ w}\. 

The number of closed triads: |{(u, v , v ) e Ec A (v, w) £ Ec A (w, w) £ Ec}\- 

The ratio of closed to open triads. 

The fraction of individuals in the fringe with at least k friends in the community for 2 < k < 19. 

The number of posts and responses made by members of the community. 

The number of members of the community with at least one post or response. 

The number of responses per post. 

Features related to 
an individual u and 
her set S of friends 
in community C. 

Number of friends in community (|Sj). 

Number of adjacent pairs in S (|{(u,u)|u,u £ S A (u,v) £ Ec} |). 

Number of pairs in S connected via a path in Ec- 

Average distance between friends connected via a path in Ec- 

Number of community members reachable from S using edges in Ec- 

Average distance from S to reachable community members using edges in Ec- 

The number of posts and response made by individuals in S. 

The number of individuals in S with at least 1 post or response. 


Figure 10.3. User Community-Joining Behavior Features (from Backstrom et al. 
[2006]). 


By performing decision tree learning for a large dataset of users and the 
features listed in Figure 10.3, one finds that not only the number of friends 
inside a community but also how these friends are connected to each other 
affect the joining probability. In particular, the denser the subgraph of 
friends inside a community, the higher the likelihood of a user joining 
the community. Let S denote the set of friends inside community C, and 
let E$ denote the set of edges between these S\ friends. The maximum 
number of edges between these S friends is ( l ^ 1 ). So, the edge density is 
c p(S ) = E s / ( s f). One finds that the higher this density, the more likely that 
one is going to join a community. Figure 10.4 shows the first two levels of the 
decision tree learned for this task using features described in Figure 10.3. 
Higher level features are more discriminative in decision tree learning, and 
in our case, the most important feature is the density of edges for the friends 
subgraph inside the community. 

To analyze community-joining behavior, one can design features that are 
likely to be related to community joining behavior. Decision tree learning 
can help identify which features are more predictive than others. However, 
how can we evaluate if these features are designed well and whether other 
features are not required to accurately predict joining behavior? Since clas¬ 
sification is used to learn the relation between features and behaviors one 
can always use classification evaluation metrics such as accuracy to evaluate 
the performance of the learned model. An accurate model translates to an 
accurate learning of feature-behavior association. 
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Are Friends with Each Other 
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Number of Connected Pairs 
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4 

4.88 x 10“ 3 


3.70 x 10 -4 7.222 x 10 -4 


1.82 x 10“ 3 


Figure 10.4. Decision Tree Learned for Community-Joining Behavior (from Backstrom 
et al. [2006]). 


A Behavior Analysis Methodology 


The analysis of community-joining behavior can be summarized via a four- 
step methodology for behavioral analysis. The same approach can be fol¬ 
lowed as a general guideline for analyzing other behaviors in social media. 

Commonly, to perform behavioral analysis, one needs the following four 
components: 

1. An observable behavior. The behavior that is analyzed needs to be 
observable. For instance, to analyze community-joining behavior, it 
is necessary to be able to accurately observe the joining of individuals 
(and possibly their joining times). 

2. Features. One needs to construct relevant data features (covariates) 
that may or may not affect (or be affected by) the behavior. Anthropol¬ 
ogists and sociologists can help design these features. The intrinsic 
relation between these features and the behavior should be clear from 
the domain expert’s point of view. In community-joining behavior, 
we used the number of friends inside the community as one feature. 

3. Feature-Behavior Association. This step aims to find the relation¬ 
ship between features and behavior, which describes how changes 
in features result in the behavior (or changes its intensity). We used 
decision tree learning to find features that are most correlated with 
community-joining behavior. 

4. Evaluation Strategy. The final step evaluates the findings. This eval¬ 
uation guarantees that the findings are due to the features defined and 
not to externalities. We use classification accuracy to verify the qual¬ 
ity of features in community-joining behavior. Various evaluation 
techniques can be used, such as randomization tests discussed in 
Chapter 8. In randomization tests, we measure a phenomenon in a 
dataset and then randomly generate subsamples from the dataset in 


10.1 Individual Behavior 


277 


which the phenomenon is guaranteed to be removed. We assume 
the phenomenon has happened when the measurements on the sub¬ 
samples are different from the ones on the original dataset. Another 
approach is to use causality testing methods. Causality testing meth¬ 
ods measure how a feature can affect a phenomenon. A well-known 
causality detection technique is called granger causality due to Clive 
W. J. Granger, the Nobel laureate in economics. 

Definition 10.1. Granger Causality. Assume we are given 
two temporal variables X = {X\, X 2 ,..., X t , X t +\, ...} and Y = 
{ Y \, 72,..., Y t , Y t+ 1 ,...}. Variable X “Granger causes” variable Y 
when historical values of X can help better predict Y than just using 
the historical values of Y. 

Consider a linear regression model outlined in Chapter 5. We can pre¬ 
dict Y t+ 1 by using either Y\, ..., Y t or a combination of X\,... ,X t 
and Y\,, Y t . 

t 

Yt+\ = Qj Yj + e\ , 
i—\ 

t t 

f +1 = a t h,+ bjXj + 62 , 
i =1 i=l 

where e\ and 62 are the regression model errors. Now, if 62 < e\ , it 
indicates that using X helps reduce the error. In this case, X Granger 
causes 7. 


( 10 . 1 ) 

( 10 . 2 ) 


10.1.2 Individual Behavior Modeling 

Similar to network models, models of individual behavior can help con¬ 
cretely describe why specific individual behaviors are observed in social 
media. In addition, they allow for controlled experiments and simulations 
that can help study individuals in social media. 

As with other modeling approaches (see Chapter 4), in behavior mod¬ 
eling, one must make a set of assumptions. Behavior modeling can be 
performed via a variety of techniques, including those from economics, 
game theory, or network science. We discussed some of these techniques in 
earlier chapters. We review them briefly here, and refer interested readers 
to the respective chapters for more details. 

• Threshold models (Chapter 8). When a behavior diffuses in a net¬ 
work, such as the behavior of individuals buying a product and 
referring it to others, one can use threshold models. In threshold 
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models, the parameters that need to be learned are the node activation 
threshold 6 t and the influence probabilities w,j. Consider the follow¬ 
ing methodology for learning these values. Consider a merchandise 
store where the store knows the connections between individuals and 
their transaction history (e.g., the items that they have bought). Then, 
Wij can be defined as the 

fraction of times user i buys a product and 

user j buys the same product soon after that 

The definition of “soon” requires clarification and can be set based on 
a site’s preference and the average time between friends buying the 
same product. Similarly, 6 j can be estimated by taking into account 
the average number of friends who need to buy a product before user 
i decides to buy it. Of course, this is only true when the products 
bought by user i are also bought by her friends. When this is not the 
case, methods from collaborative filtering (see Chapter 9) can be used 
to find out the average number of similar items that are bought by 
user i ’s friends before user i decides to buy a product. 

• Cascade Models (Chapter 7). Cascade models are examples of sce¬ 
narios where an innovation, product, or information cascades through 
a network. The discussion with respect to cascade models is similar, 
to the threshold models with the exception that cascade models are 
sender-centric. That is, the sender decides to activate the receiver, 
whereas threshold models are receiver-centric, in which receivers 
get activated by multiple senders. Therefore, the computation of ICM 
parameters needs to be done from the sender’s point of view in cascade 
models. Note that both threshold and cascade models are examples 
of individual behavior modeling. 


10.1.3 Individual Behavior Prediction 

As discussed previously, most behaviors result in newly formed links in 
social media. It can be a link to a user, as in befriending behavior; a link 
to an entity, as in buying behavior; or a link to a community, as in joining 
behavior. Hence, one can formulate many of these behaviors as a link 
prediction problem. Next, we discuss link prediction in social media. 


Link Prediction 

Link prediction assumes a graph G(V, E). Let e(u, v) e E represent an 
interaction (edge) between nodes u and v, and let t(e) denote the time of 
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the interaction. Let G[t \, C] represent the subgraph of G such that all edges 
are created between t\ and t 2 (i.e., for all edges e in this subgraph, t\ < 
tie) < C). Now given four time stamps t\ < t\ < ?2 < t' 2 , a link prediction 
algorithm is given the subgraph G[t\, t[\ (training interval) and is expected 
to predict edges in G[ti, t' 2 \ (testing interval). Note that, just like new edges, 
new nodes can be introduced in social networks; therefore, G[t 2 , t 2 ] may 
contain nodes not present in G\t\,t[\. Hence, a link prediction algorithm 
is generally constrained to predict edges only for pairs of nodes that are 
present during the training period. One can add extra constraints such 
as predicting links only for nodes that are incident to at least k edges 
(i.e., have degree greater or equal to k) during both testing and training 
intervals. 

Let G( Vt m in, Etrain) be our training graph. Then, a link prediction algo¬ 
rithm generates a sorted list of most probable edges in V lrain x V tra m — E tmi „. 
The first edge in this list is the one the algorithm considers the most likely 
to soon appear in the graph. The link prediction algorithm assigns a score 
a(x,y) to every edge in V train x V, min - E train . Edges sorted by this value 
in decreasing order will create our ranked list of predictions, a (x , y) can be 
predicted based on different techniques. Note that any similarity measure 
between two nodes can be used for link prediction; therefore, methods dis¬ 
cussed in Chapter 3 are of practical use here. We outline some of the most 
well-established techniques for computing a(x, y) here. 

Node Neighborhood-Based Methods 

The following methods take advantage of neighborhood information to 
compute the similarity between two nodes. 

• Common Neighbors. In this method, one assumes that the more 
common neighbors that two nodes share, the more similar they are. 
Let N{x) denote the set of neighbors of node x. This method is 
formulated as 


a(x,y ) = \N(x) n Afi»|. (10.3) 

• Jaccard Similarity. This commonly used measure calculates the like¬ 
lihood of a node that is a neighbor of either x or y to be a common 
neighbor. It can be formulated as the number of common neighbors 
divided by the total number of neighbors of either x or y: 
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• Adamic and Adar Measure. A similar measure to Jaccard, this a 
measure was introduced by Lada Adamic and Eytan Adar [AA]. The 
intuition behind it measure is that if two individuals share a neighbor 
and that neighbor is a rare neighbor, it should have a higher impact 
on their similarity. For instance, we can define the rareness of a node 
based on its degree (i.e., the smaller the node’s degree, the higher 
its rareness). The original version of the measure is defined based 
on webpage features. A modified version based on neighborhood 
information is 


o(x,y) 


E 

zeN(x)HN(y) 


1 

log | A(z)|' 


(10.5) 


• Preferential Attachment. In the preferential attachment model dis¬ 
cussed in Chapter 4, one assumes that nodes of higher degree have 
a higher chance of getting connected to incoming nodes. Therefore, 
in terms of connection probability, higher degree nodes are simi¬ 
lar. The preferential attachment measure is defined to capture this 
similarity: 


cr(x,y) = |A(x)| • | AGO I- ( 10 . 6 ) 


Example 10.1. For the graph depicted in Figure 10.5, the similarity 
between nodes 5 and 7 based on different neighborhood-based techniques 
is 


(Common Neighbor) a (5, 7) 
(Jaccard) a(5,1) 

(Adamic and Adar) a (5, 7) 
(Preferential Attachment) <r(5, 7) 


I {4, 6} Pi {4} | = 1 
|{4, 6} Cl {4}[ _ 1 
1(4, 6} U {4} | 2 

1 _ 1 
log |{5, 6, 7}| log3 


(10.7) 

( 10 . 8 ) 

(10.9) 


|{4}| -1{4, 6 }| = 1x2 = 2 (10.10) 
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Table 10.1. A Comparison between Link Prediction Methods 



First edge 

Second edge 

Third edge 

Common Neighbors 

(j(6, 7) = 1 

(7(1,3)= 1 

-7(5,7)= 1 

Jaccard Similarity 

(7(1,3)= 1 

(7(6,7)= 1/2 

(7(5,7)= 1/2 

Adamic and Adar 

(7(1, 3) = l/log2 

(7(6,7) = 1 / log 3 

ff (5,7)= 1 / log 3 

Preferential Attachment 

(7(2,4) = 6 

(7(2, 5) = 4 

-r(2, 6) = 4 


In Figure 10.5, there are eight nodes; therefore, we can have a maxi¬ 
mum of (^) = 28 edges. We already have six edges in the graph; hence, 
there are 28 — 6 = 22 other edges that are not in the graph. For all these 
edges, we can compute the similarity between their endpoints using the 
aforementioned neighborhood-based techniques and identify the top three 
most likely edges that are going to appear in the graph based on each tech¬ 
nique. Table 10.1 shows the top three edges based on each technique and 
the corresponding values for each edge. As shown in this table, different 
methods predict different edges to be most important; therefore, the method 
of choice depends on the application. 


Methods Based on Paths between Nodes 

Similarity between nodes can simply be computed from the shortest path 
distance between them. The closer the nodes are, the higher their similarity. 
This similarity measure can be extended by considering multiple paths 
between nodes and their neighbors. The following measures can be used to 
calculate similarity. 

• Katz measure. Similar to the Katz centrality defined in Chapter 3, 
one can define the similarity between nodes x and y as 

OO 

o 0, y) = ft \P aths x!y I > (10.11) 

l= l 

where | paths | denotes the number of paths of length l between x 
and v. f> is a constant that exponentially damps longer paths. Note 
that a very small ft results in a common neighbor measure (see Exer¬ 
cises). Similar to our finding in Chapter 3, one can find the Katz sim¬ 
ilarity measure in a closed form by (/ — ft A f 1 — I. The Katz mea¬ 
sure can also be weighted or unweighted. In the unweighted format, 
|pathsftTy | = 1 if there is an edge between x and y. The weighted ver¬ 
sion is more suitable for multigraphs, where multiple edges can exist 







282 


Behavior Analytics 


between the same pair of nodes. For example, consider two authors x 
and y who have collaborated c times. In this case, | paths x l y I = c. 

• Hitting and Commute Time. Consider a random walk that starts at 
node x and moves to adjacent nodes uniformly. Hitting time H XtV is 
the expected number of random walk steps needed to reach y starting 
from x. This is a distance measure. In fact, a smaller hitting time 
implies a higher similarity; therefore, a negation can turn it into a 
similarity measure: 


a (x,y) = -H X3y . (10.12) 

Note that if node y is highly connected to other nodes in the network 
(i.e., has a high stationary probability n y ), then a random walk starting 
from any x likely ends up visiting y early. Hence, all hitting times to y 
are very short, and all nodes become similar to y. To account for this, 
one can normalize hitting time by multiplying it with the stationary 
probability n y \ 


cr(x,y) =-H xo ,n y . (10.13) 

Hitting time is not symmetric, and in general, H xo , ^ H vx . Thus, one 
can introduce the commute time to mitigate this issue: 

cr(x,y) = —(H Xt y + H yx ). (10.14) 

Similarly, commute time can also be normalized, 

cr(x, y) = -(H x ,yK y + H y , x n x ). (10.15) 


• Rooted PageRank. A modified version of the PageRank algorithm 
can be used to measure similarity between two nodes x and y. In 
rooted PageRank, we measure the stationary probability of y: n y 
given the condition that during each random walk step, we jump to 
x with probability a or a random neighbor with probability 1 — a. 
The matrix format discussed in Chapter 3 can be used to solve this 
problem. 

• SimRank. One can define similarity between two nodes recursively 
based on the similarity between their neighbors. In other words, sim¬ 
ilar nodes have similar neighbors. SimRank performs the following: 


cr(x,y) = y 


£i'eJV(i) ^v'e.V(v) &(x , V ) 

\N(x)\\N(y)\ 


(10.16) 
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where y is some value in range [0, 1], We set o(x, x) = 1, and by 
finding the fixed point of this equation, we can find the similarity 
between node x and node y 

After one of the aforementioned measures is selected, a list of the top 
most similar pairs of nodes are selected. These pairs of nodes denote edges 
predicted to be the most likely to soon appear in the network. Performance 
(precision, recall, or accuracy) can be evaluated using the testing graph and 
by comparing the number of the testing graph’s edges that the link predic¬ 
tion algorithm successfully reveals. Note that the performance is usually 
very low, since many edges are created due to reasons not solely available 
in a social network graph. So, a common baseline is to compare the per¬ 
formance with random edge predictors and report the factor improvements 
over random prediction. 


10.2 Collective Behavior 

Collective behavior, first defined by sociologist Robert Park, refers to a 
population of individuals behaving in a similar way. This similar behavior 
can be planned and coordinated, but is often spontaneous and un planned. 
For instance, individuals stand in line for a new product release, rush into 
stores for a sale event, and post messages online to support their cause 
or show their support for an individual. These events, though formed by 
independent individuals, are observed as a collective behavior by outsiders. 


10.2.1 Collective Behavior Analysis 

Collective behavior analysis is often performed by analyzing individuals 
performing the behavior. In other words, one can divide collective behavior 
into many individual behaviors and analyze them independently. The result, 
however, when all these analyses are put together would be an expected 
behavior for a large population. The user migration behavior we discuss in 
this section is an example of this type of analysis of collective behavior. 

One can also analyze the population as a whole. In this case, an indi¬ 
vidual’s opinion or behavior is rarely important. In general, the approach 
is the same as analyzing an individual, with the difference that the content 
and links are now considered for a large community. For instance, if we 
are analyzing 1,000 nodes one can combine these nodes and edges into 
one hyper-node, where the hyper-node is connected to all other nodes in 
the graph to which its members are connected and has an internal structure 
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(subgraph) that details the interaction among its members. This approach 
is unpopular for analyzing collective behavior because it does not con¬ 
sider specific individuals and at times, interactions within the population. 
Interested readers can refer to the bibliographic notes for further references 
that use this approach to analyze collective behavior. On the contrary, this 
approach is often considered when predicting collective behavior, which 
is discussed later in this chapter. 


User Migration in Social Media 

Users often migrate from one site to another for different reasons. The main 
rationale behind it is that users have to select some sites over others due to 
their limited time and resources. Moreover, social media’s networking often 
dictates that one cannot freely choose a site to join or stay. An individual’s 
decision is heavily influenced by his or her friends, and vice versa. Sites are 
often interested in keeping their users, because they are valuable assets that 
help contribute to their growth and generate revenue by increased traffic. 
There are two types of migration that take place in social media sites: site 
migration and attention migration. 

1. Site Migration. For any user who is a member of two sites si and s 2 
at time t ,, and is only a member of s 2 at time tj > f,, then the user is 
said to have migrated from site si to site s 2 . 

2. Attention Migration. For any user who is a member of two sites 5j 
and 52 and is active at both at time t h if the user becomes inactive on 
s i and remains active on s 2 at time tj > t,, then the user’s attention 
is said to have migrated away from site si and toward site s 2 . 

Activity (or inactivity) of a user can be determined by observing the 
user’s actions performed on the site. For instance, we can consider a user 
active in interval [t ,, f, + S\, if the user has performed at least one action on 
the site during this interval. Otherwise, the user is considered inactive. 

The interval 3 could be measured at different granularity, such as days, 
weeks, months, and years. It is common to set 3 = 1 month. To analyze 
the migration of populations across sites, we can analyze migrations of 
individuals and then measure the rate at which the population of these indi¬ 
viduals is migrating across sites. Since this method analyzes migrations at 
the individual level, we can use the methodology outlined in Section 10.1.1 
for individual behavior analysis as follows. 
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The Observable Behavior 

Site migration is rarely observed since users often abandon their accounts 
rather than closing them. A more observable behavior is attention migration, 
which is clearly observable on most social media sites. Moreover, when a 
user commits site migration, it is often too late to perform preventive 
measures. However, when attention migration is detected, it is still possible 
to take actions to retain the user or expedite his or her attention migration 
to guarantee site migration. Thus, we focus on individuals whose attention 
has migrated. 

To observe attention migrations, several steps need to be taken. First, 
users are required to be identified on multiple networks so that their activity 
on multiple sites can be monitored simultaneously. For instance, username 
huan. liul on Facebook is username liuhuan on Twitter. This identifica¬ 
tion can be done by collecting information from sites where individuals list 
their multiple identities on social media sites. On social networking sites 
such as Google+ or Facebook, this happens regularly. The second step is 
collecting multiple snapshots of social media sites. At least two snapshots 
are required to observe migrations. After these two steps, we can observe 
whether attention migrations have taken place or not. In other words, we 
can observe if users have become inactive on one of the sites over time. Fig¬ 
ure 10.6 depicts these migrations for some well-known social media sites. 
In this figure, each radar chart shows migrations from a site to multiple 
sites. Each target site is shown as a vertex, and the longer the spokes toward 
that site, the larger the migrating population to it. 


Features 

Three general features can be considered for user migration: (1) user activ¬ 
ity on one site, (2) user network size, and (3) user rank. User activity is 
important, because we can conjecture that a more active user on one site is 
less likely to migrate. User network size is important, because a user with 
more social ties (i.e., friends) in a social network is less likely to move. 
Finally, user rank is important. The rank is the value of a user as perceived 
by others. A user with high status in a network is less likely to move to a 
new one where he or she must spend more time getting established. 

User activity can be measured differently for different sites. On Twitter, 
it can be the number of tweets posted by the user; on Flickr, the number 
photos uploaded by the user; and on YouTube, the number of videos the user 
has uploaded. One can normalize this value by its maximum in the site (e.g., 
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(a) Delicious 


(b) Digg 


(c) Flickr 


del.icio.us 


iifcB 


Ti reddit 




flickr 

(f) Twitter 







flickr 

(g) YouTube 

Figure 10.6. Pairwise Attention Migration among Social Media Sites. 


the maximum number of videos any user has uploaded) to get an activity 
measure in the range [0,1]. If a user is allowed to have multiple activities 
on a site, as in posting comments and liking videos, then a linear com¬ 
bination of these measures can be used to describe user activity on a 
site. 

User network size can be easily measured by taking the number of friends 
a user has on the site. It is common for social media sites to facilitate the 
addition of friends. The number of friends can be normalized in the range 
[0,1] by the maximum number of friends one can have on the site. 

Finally, user rank is how important a user is on the site. Some sites 
explicitly provide their users’ prestige rank list (e.g., top 100 bloggers), 
whereas for others, one needs to approximate a user’s rank. One way of 
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approximating it is to count the number of citations (in-links) an indi¬ 
vidual is receiving from others. A practical technique is to perform this 
via web search engines. For instance, user test on StumbleUpon has 
http: //test. stumpleupon. com as his profile page. A Google search 
for link: http: //test. stumbleupon. com provides us with the num¬ 
ber of in-links to the profile on StumbleUpon and can be considered as a 
ranking measure for user test. 

These three features are correlated with the site attention migration 
behavior and one expects changes in them when migrations happen. 


Feature-Behavior Association 

Given two snapshots of a network, we know if users migrated or not. We 
can also compute the values for the aforementioned features. Flence, we 
can determine the correlation between features and migration behavior. 

Let vector Y e M" indicate whether any of our n users have migrated or 
not. Let X t e K 3x " be the features collected (activity, friends, rank) for any 
one of these users at time stamp t. Then, the correlation between features 
X t and labels Y can be computed via logistic regression. How can we verify 
that this correlation is not random? Next, we discuss how we verify that 
this correlation is statistically significant. 


Evaluation Strategy 

To verify if the correlation between features and the migration behavior is 
not random, we can construct a random set of migrating users and compute 
^Random and T Ran d 0 m for them as well. This can be obtained by shuffling the 
rows of the original X, and Y. Then, we perform logistic regression on these 
new variables. This approach is very similar to the shuffle test presented in 
Chapter 8. The idea is that if some behavior creates a change in features, 
then other random behaviors should not create that drastic a change. So, 
the observed correlation between features and the behavior should be sig¬ 
nificantly different in both cases. The correlation can be described in terms 
of logistic regression coefficients, and the significance can be measured 
via any significance testing methodology. For instance, we can employ the 
X 2 -statistic, 


X 


2 


E 


(At - R ,) 2 


^-STATISTIC 


(10.17) 
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where n is the number of logistic regression coefficients, A ,-’s are the coef¬ 
ficients determined using the original dataset, and R, ’s are the coefficients 
obtained from the random dataset. 


10.2.2 Collective Behavior Modeling 

Consider a hypothetical model that can simulate voters who cast ballots in 
elections. This effective model can help predict an election’s turnout rate 
as an outcome of the collective behavior of voting and help governments 
prepare logistics accordingly. This is an example of collective behavior 
modeling, which improves our understanding of the collective behaviors 
that take place by providing concrete explanations. 

Collective behavior can be conveniently modeled using some of the tech¬ 
niques discussed in Chapter 4, “Network Models”. Similar to collective 
behavior, in network models, we express models in terms of characteris¬ 
tics observable in the population. For instance, when a power-law degree 
distribution is required, the preferential attachment model is preferred, and 
when the small average shortest path is desired, the small-world model is 
the method of choice. In network models, node properties rarely play a role; 
therefore, they are reasonable for modeling collective behavior. 


10.2.3 Collective Behavior Prediction 

Collective behavior can be predicted using methods we discussed in Chap¬ 
ters 7 and 8. For instance, epidemics can predict the effect of a disease on 
a population and the behavior that the population will exhibit over time. 
Similarly, implicit influence models such as the LIM model discussed in 
Chapter 8 can estimate the influence of individuals based on collective 
behavior attributes, such as the size of the population adopting an innova¬ 
tion at any time. 

As noted earlier, collective behavior can be analyzed either in terms of 
individuals performing the collective behavior or based on the population 
as a whole. When predicting collective behavior, it is more common to 
consider the population as a whole and aim to predict some phenomenon. 
This simplifies the challenges and reduces the computation dramatically, 
since the number of individuals who perform a collective behavior is often 
large and analyzing them one at a time is cumbersome. 

In general, when predicting collective behavior, we are interested in 
predicting the intensity of a phenomenon, which is due to the collec¬ 
tive behavior of the population (e.g., how many of them will vote?) To 
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perform this prediction, we utilize a data mining approach where features 
that describe the population well are used to predict a response variable (i.e., 
the intensity of the phenomenon). A training-testing framework or corre¬ 
lation analysis is used to determine the generalization and the accuracy 
of the predictions. We discuss this collective behavior prediction strategy 
through the following example. This example demonstrates how the col¬ 
lective behavior of individuals on social media can be utilized to predict 
real-world outcomes. 


Predicting Box Office Revenue for Movies 

Can we predict opening-weekend revenue for a movie from its prerelease 
chatter among fans? This tempting goal of predicting the future has been 
around for many years. The goal is to predict the collective behavior of 
watching a movie by a large population, which in turn determines the 
revenue for the movie. One can design a methodology to predict box office 
revenue for movies that uses Twitter and the aforementioned collective 
behavior prediction strategy. To summarize, the strategy is as follows: 

1. Set the target variable that is being predicted. In this case, it is the 
revenue that a movie produces. Note that the revenue is the direct 
result of the collective behavior of going to the theater to watch the 
movie. 

2. Determine the features in the population that may affect the target 
variable. 

3. Predict the target variable using a supervised learning approach, 
utilizing the features determined in step 2. 

4. Measure performance using supervised learning evaluation. 

One can use the population that is discussing the movie on Twitter before 
its release to predict its opening-weekend revenue. The target variable is 
the amount of revenue. In fact, utilizing only eight features, one can predict 
the revenue with high accuracy. These features are the average hourly 
number of tweets related to the movie for each of the seven days prior to 
the movie opening (seven features) and the number of opening theaters for 
the movie (one feature). Using only these eight features, training data for 
some movies (their seven-day tweet rates and the revenue), and a linear 
regression model, one can predict the movie opening-weekend revenue 
with high correlation. It has been shown by researchers (see Bibliographic 
Notes) that the predictions using this approach are closer to reality than that 
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of the Hollywood Stock Exchange (HSX), which is the gold standard for 
predicting revenues for movies. 

This simple model for predicting movie revenue can be easily extended 
to other domains. For instance, assume we are planning to predict another 
collective behavior outcome, such as the number of individuals who aim to 
buy a product. In this case, the target variable y is the number of individuals 
who will buy the product. Similar to tweet rate, we require some feature A 
that denotes the attention the product is receiving. We also need to model 
the publicity of the product P. In our example, this was the number of 
theaters for the movie; for a product, it could represent the number of stores 
that sell it. A simple linear regression model can help learn the relation 
between these features and the target variable: 

y = w\ A + w 2 P + e, (10.18) 

where e is the regression error. Similar to our movie example, one attempts 
to extract the values for A and P from social media. 


10.3 Summary 

Individuals exhibit different behaviors in social media, which can be cate¬ 
gorized into individual and collective behavior. Individual behavior is the 
behavior that an individual targets toward (1) another individual (; individual- 
individual behavior), (2) an entity (individual-entity behavior), or (3) a 
community (individual-community behavior). We discussed how to ana¬ 
lyze and predict individual behavior. To analyze individual behavior, there 
is a four-step procedure, outlined as a guideline. First, the behavior observed 
should be clearly observable on social media. Second, one needs to design 
meaningful features that are correlated with the behavior taking place in 
social media. The third step aims to find correlations and relationships 
between features and the behavior. The final step is to verify these rela¬ 
tionships that are found. We discussed community joining as an example 
of individual behavior. Modeling individual behavior can be performed via 
cascade or threshold models. Behaviors commonly result in interactions in 
the form of links; therefore, link prediction techniques are highly efficient 
in predicting behavior. We discussed neighborhood-based and path-based 
techniques for link prediction. 

Collective behavior is when a group of individuals with or without 
coordination act in an aligned manner. Collective behavior analysis is 
either done via individual behavior analysis and then averaged or analyzed 
collectively. When analyzed collectively, one commonly looks at the general 
patterns of the population. We discussed user migrations in social media as 
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an example of collective behavior analysis. Modeling collective behavior 
can be performed via network models, and prediction is possible by using 
population properties to predict an outcome. Predicting movie box-office 
revenues was given as an example, which uses population properties such as 
the rate at which individuals are tweeting to demonstrate the effectiveness 
of this approach. 

It is important to evaluate behavior analytics findings to ensure that these 
finding are not due to externalities. We discussed causality testing, random¬ 
ization tests, and supervised learning evaluation techniques for evaluating 
behavior analytics findings. However, depending on the context, researchers 
may need to devise other informative techniques to ensure the validity of 
the outcomes. 


10.4 Bibliographic Notes 

In addition to methods discussed in this chapter, game theory and theo¬ 
ries from economics can be used to analyze human behavior [Easley and 
Kleinberg, 2010], Community-joining behavior analysis was first intro¬ 
duced by [Backstrom et al., 2006]. The approach discussed in this chapter is 
a brief summary of their approach for analyzing community-joining behav¬ 
ior. Among other individual behaviors, tie formation is analyzed in detail. In 
[Wang et ah, 2009], the authors analyze tie formation behavior on Facebook 
and investigate how visual cues influence individuals with no prior interac¬ 
tion to form ties. The features used are gender (i.e., male or female), and 
visual conditions (attractive, nonattractive, and no photo). Their analyses 
show that individuals have a tendency to connect to attractive opposite-sex 
individuals when no other information is available. Analyzing individual 
information-sharing behavior helps understand how individuals dissemi¬ 
nate information on social media. Gundecha et al. [2011] analyze how the 
information-sharing behavior of individuals results in vulnerabilties and 
how one can exploit such vulnerabilities to secure user privacy on a social 
networking site. Finally, most social media mining research is dedicated to 
analyzing a single site; however, users are often members of different sites 
and hence, current studies need to be generalized to cover multiple sites. 
Zafarani and Liu [2009a, 2013] were the first to design methods that help 
connect user identities across social media sites using behavioral modeling. 
A study of user tagging behavior across sites is available in [Wang et al., 
2011 ], 

General surveys on link prediction can be found in [Adamic and Adar, 
2003; Liben-Nowell and Kleinberg, 2003; Al Hasan et al., 2006; Lir and 
Zhou, 2011]. Individual behavior prediction is an active area of research. 
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Location prediction is an active area of individual behavior analysis that has 
been widely studied over a long period in the realm of mobile computing. 
Researchers analyze human mobility patterns to improve location prediction 
services, thereby exploiting their potential power on various applications 
such as mobile marketing [Barwise and Strong, 2002; Barnes and Scor- 
navacca, 2004], traffic planning [Ben-Akiva et ah, 1998; Dia, 2001], and 
even disaster relief [Gao et al., 2011 a,b; Goodchild and Glennon, 2010; 
Gao et ah, 2012a; Wang and Huang, 2010; Barbier et ah, 2012; Kumar 
et ah, 2013]. Other general references can be found in [Backstrom et ah, 
2010; Monreale et ah, 2009; Spaccapietra et ah, 2008; Thanh and Phuong, 
2007; Scellato et ah, 2011; Gao et ah, 2012b, c], 

Kumar et ah [2011] first analyzed migration in social media. Other 
collective behavior analyses can be found in Leskovec et ah [2009], The 
movie revenue prediction was first discussed by Asur and Huberman [201 0] . 
Another example of collective behavior prediction can be found in the work 
of O’Connor et ah [2010], which proposed using Twitter data for opinion 
polls. Their results are highly correlated with Gallup opinion polls for 
presidential job approval. In Abbasi et ah [2012] analyzed collective social 
media data and show that by carefully selecting data from social media, it is 
possible to use social media as a lens to analyze and even predict real-world 
events. 


10.5 Exercises 
Individual Behavior 

1. • Name five real-world behaviors that are commonly difficult to observe 

in social media (e.g., your daily schedule or where you eat lunch are 
rarely available in social media). 

• Select one behavior that is most likely to leave traces online. Can you 
think of a methodology for identifying that behavior? 

2. Consider the “commenting under a blogpost” behavior in social media. 
Follow the four steps of behavior analysis to analyze this behavior. 

3. We emphasized selecting meaningful features for analyzing a behavior. 
Discuss a methodology to verily if the selected features carry enough 
information with respect to the behavior being analyzed. 

4. Correlation does not imply causality. Discuss how this fact relates to 
most of the datasets discussed in this chapter being temporal. 

5. Using a neighborhood-based link prediction method compute the top 
two most likely edges for the following figure. 
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6 . Compute the most likely edge for the following figure for each path- 
based link prediction technique. 



7. In a link prediction problem, show that for small ft. the Katz similarity 

measure ( a(u, v) = ■ \path^\) becomes Common neighbors 

( a(u, w) = \N(u ) H N(v )|). 

8 . Provide the matrix format for rooted PageRank and SimRank tech¬ 
niques. 


Collective Behavior 

9. Recent research has shown that social media can help replicate survey 
results for elections and ultimately predict presidential election out¬ 
comes. Discuss what possible features can help predict a presidential 
election. 

















Notes 


Chapter 1 

1. The data has a power-law distribution and more often than not, data is not inde¬ 
pendent and identically distributed (i.i.d.) as generally assumed in data mining. 

Chapter 2 

1. This is similar to plotting the probability mass function for degrees. 

2. Instead of W in weighted networks, C is used to clearly represent capacities. 

3. This edge is often called the weak link. 

4. The proof is omitted here and is a direct result from the minimum-cut/maximum 
flow theorem not discussed in this chapter. 

Chapter 3 

1. This constraint is optional and can be lifted based on the context. 

2. When det(I — aA T ) = 0, it can be rearranged as det (A T — a~ l I) = 0, which is 
basically the characteristic equation. This equation first becomes zero when the 
largest eigenvalue equals a -1 , or equivalently a = 1/2. 

3. When d° at = 0, we know that since the out-degree is zero, Vi, A Jtj = 0, this makes 
the term inside the summation jj. We can fix this problem by setting d° ut = 1 since 
the node will not contribute any centrality to any other nodes. 

4. Here, we start from tq and follow the edges. One can start from a different node, 
and the result should remain the same. 

5. HITS stands for hypertext-induced topic search. 

Chapter 4 

1. For a more detailed approach refer to [Clauset et al., 2009], 

2. Note that for c — 1, the component size is stable, and in the limit, no growth will 
be observed. The phase transition happens exactly at c = 1. 

3. Hint: The proof is similar to the proof provided for the likelihood of observing m 
edges (Proposition 4.3). 
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Notes 


Chapter 5 

1. See [Zafarani, Cole, and Liu, 2010] for a repository of network data. 

2. One can use all unique words in all documents ( D ) or a more frequent subset of 
words in the documents for vectorization. 

3. Note that in our example, the class attribute can take two values; therefore, the 
initial guess of P(y, — l|JV(o,)) = ~ = 0.5 is reasonable. When a class attribute 
takes n values, we can set our initial guess to P(yi = 1 |/V(o,)) = 

Chapter 6 

1. For more details refer to [Chung, 1996]. 

2. See [Kossinets and Watts, 2006] for details. 

3. Let X be the solution to spectral clustering. Consider an orthogonal matrix 
Q (i.e., QQ t = I). Let Y = XQ. In spectral clustering, we are maximiz¬ 
ing Tr(X T LX) = Tr{X T LXQQ T ) = Tr(Q T X T LXQ) = Tr((XQ) T L(XQ)) = 
Tr(Y T LY). In other words, Y is another answer to our trace-maximization prob¬ 
lem. This proves that the solution X to spectral clustering is non-unique under 
orthogonal transformations Q. 

4. http://www.mturk.com. 

Chapter 7 

1. This assumption can be lifted [Kempe et al., 2003], 

2. See [Gruhl et ah. 2004] for an application in the blogosphere. 

3. Formally, assuming P ^ NP , there is no polynomial time algorithm for this prob¬ 
lem. 

4. The internal-influence model is similar to the SI model discussed later in the section 
on epidemics. For the sake of completeness, we provide solutions to both. Readers 
are encouraged to refer to that model in Section 7.4 for further insight. 

5. A generalization of these techniques over networks can be found in [Hethcote et ah, 
1981; Hethcote, 2000; Newman, 2010], 

Chapter 8 

1. From ADD health data: http://www.cpc.unc.edu/projects/addhealth. 

2. The directed case is left to the reader. 

3. As defined by the Merriam-Webster dictionary. 

4. In the original paper, the authors utilize a weight function instead. Here, for clarity, 
we use coefficients for all parameters. 

5. Note that Equation 8.28 is defined recursively, because I(p) depends on 
InfluenceFlow and that, in turn, depends on I(p) (Equation 8.27). Therefore, to 
estimate /(/>), we can use iterative methods where we start with an initial value for 
I(p) and compute until convergence. 

6. Note that we have assumed that homophily is the leading social force in the 
network that leads to its assortativity change. This assumption is often strong for 
social networks because other social forces act in these networks. 
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7. In the original paper, instead of a, the authors use ln(rz + 1) as the variable. This 
helps remove the effect of a power-law distribution in the number of activated 
friends. Here, for simplicity, we use the nonlogarithmic form. 

8. Note that maximizing this term is equivalent to maximizing the logarithm; this is 
where Equation 8.41 comes into play. 

Chapter 9 

1. In Matlab, this can be performed using the svd command. 
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